Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.

This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.

To cite this case study please use:

Wright, Carrie, and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-youth-mental-health-case-study. Mental Health of American Youth (Version v1.0.0).

Motivation


Rates of depression appear to have been increasing among American youths since around 2010 according to a recent report. A recent study) also shows that youths appear to be seeking more care from mental health services.

This case study will explore how rates of major depressive episodes have changed since the early 2000s. As well as how rates of treatment for depression of youths have changed over time.

Photo by K. Mitch Hodge on Unsplash

The major symptoms of a major depressive episode include:

Sleep disorder (increased or decreased)
Interest deficit (anhedonia)
Guilt (worthlessness, hopelessness, regret)
Energy deficit
Concentration deficit
Appetite disorder (increased or decreased)
Psychomotor retardation or agitation
Suicidality

[source]

Click here to see the diagnostic requirements for a major depressive episode according to the DSM 5

Five or more of the following symptoms have been present and documented during the same two-week period and represent a change from previous functioning; at least one of the symptoms is either (1) depressed mood or (2) loss of interest or pleasure.

Note: Do not include symptoms that are clearly attributable to another medical condition.

  1. Depressed mood most of the day, nearly every day, as indicated by either subjective report (e.g., feels sad, empty, hopeless) or observation made by others (e.g., appears tearful)

  2. Markedly diminished interest or pleasure in all, or almost all, activities most of the day, nearly every day (as indicated by either subjective account or observation)

  3. Significant weight loss when not dieting or weight gain (e.g., a change of more than 5% of body weight in a month), or decrease or increase in appetite nearly every day

  4. Insomnia or hypersomnia nearly every day

  5. Psychomotor agitation or retardation nearly every day (observable by others, not merely subjective feelings of restlessness or being slowed down)

  6. Fatigue or loss of energy nearly every day

  7. Feelings of worthlessness or excessive or inappropriate guilt (which may be delusional) nearly every day (not merely self-reproach or guilt about being sick)

  8. Diminished ability to think or concentrate, or indecisiveness, nearly every day (either by subjective account or as observed by others)

  9. Recurrent thoughts of death (not just fear of dying), recurrent suicidal ideation without a specific plan, or a suicide attempt or a specific plan for committing suicide

B. The symptoms do not meet criteria for a mixed episode.

C. The episode is not attributable to the physiological effects of a substance or to another medical condition.

Note: Criteria A-C represent a major depressive episode.

Note: Responses to a significant loss (e.g., bereavement, financial ruin, losses from a natural disaster, a serious medical illness or disability) may include feelings of intense sadness, rumination about the loss, insomnia, poor appetite and weight loss noted in Criterion A, which may resemble a depressive episode. Although such symptoms may be understandable or considered appropriate to the loss, the presence of a major depressive episode in addition to the normal response to a significant loss should also be carefully considered. This decision inevitably requires the exercise of clinical judgment based on the individual’s history of and the cultural norms for the expression of distress in the context of loss.

D. The occurrence of the major depressive episode is not better explained by schizoaffective disorder, schizophrenia, schizophreniform disorder, delusional disorder, or other specified and unspecified schizophrenia spectrum and other psychotic disorders.

E. There has never been a manic episode or a hypomanic episode.

Note: This exclusion does not apply if all of the manic-like or hypomanic-like episodes are substance-induced or are attributable to the physiological effects of another medical condition.

This case study is motivated by the following two articles:

Twenge JM, Cooper AB, Joiner TE, Duffy ME, Binau SG. Age, period, and cohort trends in mood disorder indicators and suicide-related outcomes in a nationally representative dataset, 2005-2017. J Abnorm Psychol.128,3 (2019):185-199. doi:10.1037/abn0000410

Olfson, M., Blanco, C., Wang, S., Laje, G. & Correll, C. U. National Trends in the Mental Health Care of Children, Adolescents, and Adults by Office-Based Physicians. JAMA Psychiatry. 71, 81 (2014):81-90. doi: 10.1001/jamapsychiatry.2013.3074.

The main findings of the first article are:

Rates of major depressive episode in the last year increased 52% 2005–2017 (from 8.7% to 13.2%) among adolescents aged 12 to 17 and 63% 2009–2017 (from 8.1% to 13.2%) among young adults 18–25.

Serious psychological distress in the last month and suicide-related outcomes (suicidal ideation, plans, attempts, and deaths by suicide) in the last year also increased among young adults 18–25 from 2008–2017 (with a 71% increase in serious psychological distress), with less consistent and weaker increases among adults ages 26 and over.

Cultural trends contributing to an increase in mood disorders and suicidal thoughts and behaviors since the mid-2000s, including the rise of electronic communication and digital media and declines in sleep duration, may have had a larger impact on younger people, creating a cohort effect.

While the main findings of the second article are:

Compared with adult mental health care, the mental health care of young people has increased more rapidly.

Between 1995-1998 and 2007-2010, visits resulting in mental disorder diagnoses per 100 population increased significantly faster for youths (from 7.78 to 15.30 visits) than for adults (from 23.23 to 28.48 visits) (interaction: P < .001).

Psychiatrist visits also increased significantly faster for youths (from 2.86 to 5.71 visits).

Again while depression appear to be on the rise for youths, youths also appear to be seeking more mental health care.

In this case study we will be using data from the National Survey on Drug Use and Health (NSDUH) related to treatment and major depressive episode rate to explore how this have changed over time and how different groups compare. This data was also used in the first referenced article.

Main Questions


Our main questions:

  1. How have depression rates in American youth changed since 2004, according to the NSDUH data? How have rates differed between different youth subgroups (age, gender, ethnicity)?
  2. Do mental health services appear to be reaching more youths? Again, how have rates differed between different youth subgroups (age, gender, ethnicity)?

Learning Objectives


Statistical Learning Objectives:

  1. Define and create a contingency table.
  2. Implementation of a chi-squared test for independence.
  3. Interpretation of a chi-squared test for independence.

Data science Learning Objectives:

  1. Scrape data directly from a website (rvest).
  2. Subset and filter data (dplyr).
  3. Write functions to wrangle data repetitively.
  4. Work with character strings (stringr).
  5. Reshape data into different formats (tidyr).
  6. Data visualizations (ggplot2) with labels (directlabels) and facets for different groups.

In this case study, we will especially focus on using packages and functions from the Tidyverse, such as rvest. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.


We will begin by loading the packages that we will need:

Package Use
here to easily load and save data
rvest to scrape web pages
dplyr to scrape web pages
magrittr to scrape web pages
stringr to manipulate strings
tidyr to change the shape or format of tibbles to wide and long
tibble to create tibbles and convert values of a column to row names
ggplot2 to create plots
directlabels to add labels directly to lines in plots
forcats to reorder factor for plot
cowplot to combine plots together

The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.

Context


According to the CDC the rate of suicide has also increased for most age groups in the United States over the past decade and a half.

[source]

While suicide does appear to be increasing among youths it also appears to be increasing among middle aged adults for both females and males.

[source]

According to the CDC:

Since 2008, suicide has ranked as the 10th leading cause of death for all ages in the United States. In 2016, suicide became the second leading cause of death among those aged 10–34 and the fourth leading cause among those aged 35–54.

[source]

So although suicide is on the rise for most age groups, suicide is one of the top two contributors to death for youths.

Thus this warrants further examination of the mental health of American youths.

[source]

Historically, suicide rates were much higher before 1950, however, we are seeing an increase in the last 20 years.

[source]

Besides the US, other countries are also experiencing increased rates of depression in youths. See this report from the World Health Organization about rates of depression in other countries.

See here for an interesting discussion about what may be causing increased depression rates.

According to the National Institute of Mental Health (NIMH):

If you are in crisis and need help, call this toll-free number for the National Suicide Prevention Lifeline (NSPL), available 24 hours a day, every day: 1-800-273-TALK (8255). The service is available to everyone. The deaf and hard of hearing can contact the Lifeline via TTY at 1-800-799-4889. All calls are confidential. You can also visit the Lifeline’s website at www.suicidepreventionlifeline.org.

The Crisis Text Line is another free, confidential resource available 24 hours a day, seven days a week. Text “HOME” to 741741 and a trained crisis counselor will respond to you with support and information over text message. Visit www.crisistextline.org.

Also see here for more information about how to recognize and help youths experiencing symptoms of depression.

Limitations


avocado -Michael:Perhaps “underestimates in the p-values…” is not the correct way to phrase this. I would look for a better way to word this.-Carrie:This is my attempt after Michael’s… open to changes!

There are some important considerations regarding this data analysis to keep in mind:

  1. The data that we will use come from a survey and are therefore values from a sample that estimate that of the true population. In our statistical analysis we use these sample values as if they are population estimates (because this is all we have access to). Thus our results are not necessarily indicative of true differences.

  2. Furthermore, the sampling mechanism utilized can introduce selection bias in cases where the the sampling methods do not produce a representative sample.

  3. Data is collected from human participants; this presents the potential for information bias, as there is the potential that participants in the sampling frame may for a variety of reasons report inaccurate information.

What are the data?


We will be using data from the National Survey on Drug Use and Health (NSDUH) which is directed by the Substance Abuse and Mental Health Services Administration (SAMHSA), an agency in the U.S. Department of Health and Human Services (DHHS).

This survey started in 1971 and is conducted annually in all 50 states and the District of Columbia. Approximately 70,000 people (age 12 and up) are interviewed each year about health related issues. Only civilian, non-institutionalized individuals are included. Households are randomly selected and than a professional interviewer visits the addresses and asks one or two of the residents to interview. The interviewer brings a laptop with them that the participants use to fill out the survey which typically takes an hour to complete. If a participant chooses to participate they receive $30 in cash. All collected information is confidential and is used for disease surveillance and to guide public policy particularly focused on drug and alcohol use as well as mental health. See here for more details about the survey.

This data is made available publicly online on the Substance Abuse & Mental Health Data Archive.

At the website for the survey data, you can see that the results are displayed in many tables. Importantly, there is no obvious way to download the data directly from this particular website.

If one clicks on the TOC button on the far right upper corner they will be directed to another website, where a large pdf document containing of all of the results can be downloaded.

We are interested in investigating how depression rates have changed and how youths are interacting with mental health services. Thus the following tables are of interest to us are:

Table Details
Table 11.1A Settings Where Mental Health Services Were Received in Past Year among Persons Aged 12 to 17: Numbers in Thousands, 2002-2018
Table 11.1B Settings Where Mental Health Services Were Received in Past Year among Persons Aged 12 to 17: Percentages, 2002-2018
Table 11.2A Major Depressive Episode in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Numbers in Thousands, 2004-2018
Table 11.2B Major Depressive Episode in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Percentages, 2004-2018
Table 11.3A Major Depressive Episode with Severe Impairment in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Numbers in Thousands, 2006-2018
Table 11.3B Major Depressive Episode with Severe Impairment in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Percentages, 2006-2018
Table 11.4A Receipt of Treatment for Depression in Past Year among Persons Aged 12 to 17 with Major Depressive Episode in Past Year, by Demographic Characteristics: Numbers in Thousands, 2004-2018
Table 11.4B Receipt of Treatment for Depression in Past Year among Persons Aged 12 to 17 with Major Depressive Episode in Past Year, by Demographic Characteristics: Percentages, 2004-2018

According to the NSDUH 2018 report

Respondents were defined as having had an MDE in the past 12 months if they had at least one period of 2 weeks or longer in the past year when they experienced a depressed mood or loss of interest or pleasure in daily activities, accompanied by problems with sleeping, eating, energy, concentration, or self-worth. The MDE questions are based on diagnostic criteria from DSM-5. Some of the wordings of the depression questions for adolescents aged 12 to 17 and adults aged 18 or older differed slightly to make the questions more developmentally appropriate for adolescents.

Adolescents were defined as having an MDE with severe impairment if their depression caused severe problems with their ability to do chores at home, do well at work or school, get along with their family, or have a social life.

Data Import


Data is often made available online. Usually, the data we are interested in is made available for download on the page as a delimited text file or an excel file. However, sometimes data is not made available in this manner, such as the NSDUH survey data.

How do we proceed in this scenario?

We can manually copy each cell of data, however, this process is often inefficient, subject to error, and not reproducible. Say we wanted to run an analysis next year on the next years data and it happens to be formatted in the same way.

We can also use R for web scraping.

Web scraping is the process of extracting data from a website.

Basic steps of web scraping

There are two main steps to web scraping:

  1. Identify location of data on the webpage that will be scraped

  2. Save the webpage element to an object

We accomplish STEP 1 with our web browser.

We accomplish STEP 2 in the R programming environment.

Additional resources for web scraping:

In this case study we will scrape data from the tables on the NSDUH survey website. This data is available in a large PDF with all the results form the year. However it is not easy to find this PDF and it would be difficult and time consuming to find our tables of interest and to extract the data from the pdf with pdftools. Again, if we instead decided to copy paste the data from the website to another file that we would also need to import, this would not be as efficient or reproducible and might result in errors.

Alternatively, we will use the rvest package to scrape the data directly from the tables on the website. Assuming the data next year would be displayed in a similar manner, this could allow us simply modify our code based on the url for the data next year to run the same analysis on the data easily.

The rvest package can be thought of as the pdftools package for web scraping. Upon pulling the data, additional wrangling will likely be required; but like the pdftools package, rvest streamlines the extraction process.

Steps for scraping tables

The two web scraping steps for these tables can be broken down even further:

  1. Identify location of data that will be scraped
  • right-click to inspect element (webpage)
  • hover pointer over components of element (webpage) until the data has been found
  • copy XPath of data sought
  1. Save webpage element to an object in R
  • import html code for the webpage
  • extract pieces of HTML documents (webpage) using XPath
  • parse the extracted data into a data frame

Below is a animated overview of the process.


Now let’s go through each step together:

1) Identify location of data that will be scraped

First, let’s go to the web page with all the tables we are interested in scraping

Once on the webpage, there aren’t any visible options to download the data.

Right-click and select “Inspect”

A window opens.

This window allows us to glance at the internal mechanics of the webpage. To scrape the data from the webpage, we need to first learn a little bit about the components that make it the web page it is.

Hovering our mouse over the elements of the webpage highlights the respective section of the webpage it represents. By hovering over several elements—and clicking on the elements on the right side of the screen—we can identify the element that contains the data we are looking for. Another option for identifying XPaths is to use the selectorgadget tool.

Right click on the element and copy the XPath. We will need this XPath for the next step.

Now we can return to the R programming environment.


2) Save webpage element to an object in R

For the first table we want to scrape, the XPath is /html/body/div[4]/div[1]/table. We use this XPath with functions from the rvest package to scrape the data from this table.

Let’s explore this step in greater detail:

We need to:

  • import html code for the webpage
  • extract pieces (table) out of HTML documents (webpage) using XPath
  • parse the html table into a data frame

To do this:

  • We import the html code using the read_html() function of the rvest package
  • We extract specific components of the webpage using the html_nodes() function of the rvest package
  • We convert this html table into a dataframe using the html_table()function of the rvest package

The rvest package provides wrappers for the xml2 and httr packages, thus we can just install and load the rvest package and it will install and load dependency packages like xml2 and httr and allow us to use functions from both of these packages.

In fact, when we load rvest the first time we see:

In this case, we are scraping table 11.1a from the website. First we assign the url to a character string to use within the read_html() function of the xml2 package.

One could also directly use the url but this is less convenient for piping.

Click here if you are unfamiliar with piping in R, which uses this %>% operator

By piping we mean using the %>% pipe operator which are usable when loading the tidyverse or several of the packages within the tidyverse like dplyr because they load the magrittr package. This allows us to perform multiple sequential steps on one data input.

The read_html() function then allows us to save the html document for the webpage inside R.

{html_document}
<html lang="en">
[1] <head>\n<!-- Google Tag Manager --><script>(function(w,d,s,l,i){w[l]=w[l] ...
[2] <body>\r\n<!-- Google Tag Manager (noscript) -->\r\n<noscript><iframe src ...

Then we use the html_nodes() function of the rvest package to select just the table11.1a element of the webpage.

See this tutorial (and the answers in case you get stuck) on CSS selectors to understand more about how this function works to use the xpath to select the elements of interest from the webpage.

{xml_nodeset (1)}
[1] <table class="rti">\n<caption>Table 11.1A – Settings Where Mental Health  ...

Finally, the html_table() function of the rvest package parses the html object into a data frame.

[[1]]
     Setting Where Mental Health ServiceWas Received 2002 2003 2004 2005 2006
     2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
 [ reached 'max' / getOption("max.print") -- omitted 21 rows ]
List of 1
 $ :'data.frame':   21 obs. of  18 variables:
  ..$ Setting Where Mental Health ServiceWas Received: chr [1:21] "SPECIALTY MENTAL HEALTH SERVICE1" "Outpatient" "Private Therapist, Psychologist,\r\n   Psychiatrist, Social Worker, or\r\n   Counselor" "Mental Health Clinic or Center" ...
  ..$ 2002                                           : chr [1:21] "2,898a" "2,662a" "2,254a" "611a" ...
  ..$ 2003                                           : chr [1:21] "3,065a" "2,795a" "2,347a" "635a" ...
  ..$ 2004                                           : chr [1:21] "3,348a" "3,015a" "2,523a" "716a" ...
  ..$ 2005                                           : chr [1:21] "3,362a" "3,048a" "2,573a" "657a" ...
  ..$ 2006                                           : chr [1:21] "3,255a" "2,931a" "2,416a" "587a" ...
  ..$ 2007                                           : chr [1:21] "3,104a" "2,787a" "2,365a" "583a" ...
  ..$ 2008                                           : chr [1:21] "3,129a" "2,837a" "2,408a" "567a" ...
  ..$ 2009                                           : chr [1:21] "2,925a" "2,650a" "2,296a" "537a" ...
  ..$ 2010                                           : chr [1:21] "2,920a" "2,635a" "2,265a" "547a" ...
  ..$ 2011                                           : chr [1:21] "3,101a" "2,842a" "2,409a" "547a" ...
  ..$ 2012                                           : chr [1:21] "3,118a" "2,846a" "2,427a" "610a" ...
  ..$ 2013                                           : chr [1:21] "3,341a" "3,064a" "2,572a" "731a" ...
  ..$ 2014                                           : chr [1:21] "3,369a" "3,110a" "2,698a" "760a" ...
  ..$ 2015                                           : chr [1:21] "3,253a" "2,958a" "2,532a" "792a" ...
  ..$ 2016                                           : chr [1:21] "3,598a" "3,239a" "2,819a" "929" ...
  ..$ 2017                                           : chr [1:21] "3,646a" "3,328" "2,908" "995" ...
  ..$ 2018                                           : chr [1:21] "3,901" "3,547" "3,080" "977" ...

We can see that the output is a list with one element, to extract the data from the list we will use brackets [[]] to select the first element of the list.

Putting this all of this together we can do the entire process like this with our pipe operator %>%.

Now need to repeat the above process for the other tables we are interested in.

Writing a function to scrape multiple tables

We can create a function to accomplish this succinctly. Functions allow us to perform the same process on multiple data inputs. See this other case study for more details about how to write a function.

In general, the process pf writing functions involves first specifying an input that is used within the function to create an output. In this case the data input is XPATH which will be replaced by an actual XPath and then used in the subsequent steps to scrape the data from each table that an XPath is supplied for.

We will all this function scraper.

Now we can apply the function we created to each of the XPaths for each of the tables on the website that we would like to use in our data analysis.

Great! We have successfully scraped the data.

Now we need to wrangle the data.

Data Exploration and Wrangling


Now that we’ve imported the data, let’s see if we can wrangle a table. Since the data appears to be formatted in a similar way in each of the tables, it is likely that whatever steps we take to wrangle this first table will also be necessary in the wrangling of subsequent tables. This is because well-maintained data sources often format different datasets similarly. We can take advantage of this similarity to speed up the wrangling process.

Table11.1a

First we want to remove the last row of our data frame, which happens to be the legend of our table. Recall from looking at the website that there is a legend for this table.

We can take a look at the last row using the tail function of the dplyr package. We can specify that we only want to see the last row by using the n = 1 argument.

# A tibble: 1 x 18
  `Setting Where … `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009`
  <chr>            <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
1 "* = low precis… "* = … "* = … "* = … "* = … "* = … "* = … "* = … "* = …
# … with 9 more variables: `2010` <chr>, `2011` <chr>, `2012` <chr>,
#   `2013` <chr>, `2014` <chr>, `2015` <chr>, `2016` <chr>, `2017` <chr>,
#   `2018` <chr>

We can see that the legend is repeated for every column. Let’s take a look at the year 2004 column.

# A tibble: 1 x 1
  `2004`                                                                        
  <chr>                                                                         
1 "* = low precision; -- = not available; da = does not apply; nc = not compara…

Let’s save this so that we can refer back to it later:

Another way to look at the last row is to use the n() function of the dplyr package. This function can be used inside other dplyr functions and it counts the total number of observations of a group. Within the slice() function of the dplyr package, it allows you to refer the full length of the object.

# A tibble: 1 x 18
  `Setting Where … `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009`
  <chr>            <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
1 "* = low precis… "* = … "* = … "* = … "* = … "* = … "* = … "* = … "* = …
# … with 9 more variables: `2010` <chr>, `2011` <chr>, `2012` <chr>,
#   `2013` <chr>, `2014` <chr>, `2015` <chr>, `2016` <chr>, `2017` <chr>,
#   `2018` <chr>

We can use the slice() function of the dplyr package to remove this row, by using the slicefunction to select from the first row using 1: to the second to last row using n()-1.

We are also going to use a special pipe operator from the magrittr package called the compound assignment pipe-operator or sometimes the double pipe operator. This allows us to use the table11.1a as our input and reassign it at the end after all the steps have been performed.

We will also first change the data to be a tibble.), which is the tidyverse version of a data frame using the as_tibble() function of the dplyr package and the tibble package.

Now let’s take a look at data:

# A tibble: 20 x 18
   `Setting Where … `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009`
   <chr>            <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
 1 "SPECIALTY MENT… "2,89… "3,06… "3,34… "3,36… "3,25… "3,10… "3,12… "2,92…
 2 "Outpatient"     "2,66… "2,79… "3,01… "3,04… "2,93… "2,78… "2,83… "2,65…
 3 "Private Therap… "2,25… "2,34… "2,52… "2,57… "2,41… "2,36… "2,40… "2,29…
 4 "Mental Health … "611a" "635a" "716a" "657a" "587a" "583a" "567a" "537a"
 5 "Partial Day Ho… "440"  "425"  "439"  "449"  "471"  "416"  "374a" "340a"
 6 "In-Home Therap… "693a" "656a" "762a" "731a" "719a" "707a" "716a" "657a"
 7 "Inpatient or R… "509a" "542a" "629"  "619"  "596"  "581a" "539a" "524a"
 8 "Hospital"       "422a" "467a" "515"  "529"  "516"  "511a" "469a" "440a"
 9 "Residential Tr… "224a" "233a" "299"  "229a" "225a" "199a" "198a" "213a"
10 "NONSPECIALTY M… "nc"   "nc"   "nc"   "nc"   "nc"   "nc"   "nc"   "3,43…
11 "Education2,3"   "nc"   "nc"   "nc"   "nc"   "nc"   "nc"   "nc"   "2,93…
12 "School Social … "nc"   "nc"   "nc"   "nc"   "nc"   "nc"   "nc"   "2,28…
13 "Special School… "nc"   "nc"   "nc"   "nc"   "nc"   "nc"   "nc"   "975a"
14 "General Medici… ""     ""     ""     ""     ""     ""     ""     ""    
15 "Pediatrician o… "657a" "732"  "840"  "810"  "694"  "692"  "710"  "605a"
16 "Juvenile Justi… ""     ""     ""     ""     ""     ""     ""     ""    
17 "Juvenile Deten… "--"   "--"   "--"   "--"   "--"   "--"   "--"   "109a"
18 "Child Welfare"  ""     ""     ""     ""     ""     ""     ""     ""    
19 "Foster Care or… "157a" "179a" "158a" "143a" "129"  "114"  "118"  "92"  
20 "SPECIALTY MENT… "nc"   "nc"   "nc"   "nc"   "nc"   "nc"   "nc"   "1,22…
# … with 9 more variables: `2010` <chr>, `2011` <chr>, `2012` <chr>,
#   `2013` <chr>, `2014` <chr>, `2015` <chr>, `2016` <chr>, `2017` <chr>,
#   `2018` <chr>

Great! We can see the the legend is no longer part of the data.

Now let’s use the legend to recode the data. There are many different values for missing data, that we would like to replace with NA instead. We can use the pull() function of the dplyr package to take a look at the legend data:

[1] "* = low precision; -- = not available; da = does not apply; nc = not comparable due to methodological changes; nr = not reported due to measurement issues.\r\nNOTE: Some 2006 to 2010 estimates may differ from previously published estimates due to updates (see Section 3.3.5 in Chapter 3 of the 2018 National Survey on Drug Use and Health: Methodological Summary and Definitions).\r\nNOTE: Mental health services for persons aged 12 to 17 includes treatment/counseling for emotional or behavioral problems not caused by drug or alcohol use. Respondents with unknown mental health service information were excluded.\r\nNOTE: Respondents could indicate multiple service settings; thus, the response categories are not mutually exclusive.a The difference between this estimate and the 2018 estimate is statistically significant at the .05 level. Rounding may make the estimates appear identical.1 Because of revisions in 2013 to Specialty Mental Health Service estimates, these 2002 to 2012 estimates may differ from estimates published prior to the 2013 NSDUH.2 Because of revisions in 2009 to the questions on the Source of Youth Mental Health Education Services, these estimates are not comparable with the education services estimates published prior to the 2009 NSDUH.3 Respondents who did not report their school enrollment status, who reported not being enrolled in school in the past 12 months, or who reported being home-schooled were not asked about receipt of mental health services from this setting; however, respondents who reported not being enrolled in school in the past 12 months were classified as not having received treatment/counseling from this setting.\r\nDefinitions: Measures and terms are defined in Appendix A.\r\nSource: SAMHSA, Center for Behavioral Health Statistics and Quality, National Survey on Drug Use and Health, 2002-2018."

Looks like we want to replace values that are: *, --, da, nc, and nr. We can use the na_if() function to recode these values to NA.

avocado… there isn’t support for doing this in one command… but could at least do two commands

# A tibble: 20 x 18
   `Setting Where … `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009`
   <chr>            <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
 1 "SPECIALTY MENT… 2,898a 3,065a 3,348a 3,362a 3,255a 3,104a 3,129a 2,925a
 2 "Outpatient"     2,662a 2,795a 3,015a 3,048a 2,931a 2,787a 2,837a 2,650a
 3 "Private Therap… 2,254a 2,347a 2,523a 2,573a 2,416a 2,365a 2,408a 2,296a
 4 "Mental Health … 611a   635a   716a   657a   587a   583a   567a   537a  
 5 "Partial Day Ho… 440    425    439    449    471    416    374a   340a  
 6 "In-Home Therap… 693a   656a   762a   731a   719a   707a   716a   657a  
 7 "Inpatient or R… 509a   542a   629    619    596    581a   539a   524a  
 8 "Hospital"       422a   467a   515    529    516    511a   469a   440a  
 9 "Residential Tr… 224a   233a   299    229a   225a   199a   198a   213a  
10 "NONSPECIALTY M… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   3,430a
11 "Education2,3"   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   2,931a
12 "School Social … <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   2,286 
13 "Special School… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   975a  
14 "General Medici… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
15 "Pediatrician o… 657a   732    840    810    694    692    710    605a  
16 "Juvenile Justi… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
17 "Juvenile Deten… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   109a  
18 "Child Welfare"  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
19 "Foster Care or… 157a   179a   158a   143a   129    114    118    92    
20 "SPECIALTY MENT… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   1,226a
# … with 9 more variables: `2010` <chr>, `2011` <chr>, `2012` <chr>,
#   `2013` <chr>, `2014` <chr>, `2015` <chr>, `2016` <chr>, `2017` <chr>,
#   `2018` <chr>

Now let’s rename the first column using the rename() function of the dplyr package. This requires listing the new name first like so: new_name = old_name.

# A tibble: 6 x 18
  MHS_setting `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010`
  <chr>       <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
1 "SPECIALTY… 2,898a 3,065a 3,348a 3,362a 3,255a 3,104a 3,129a 2,925a 2,920a
2 "Outpatien… 2,662a 2,795a 3,015a 3,048a 2,931a 2,787a 2,837a 2,650a 2,635a
3 "Private T… 2,254a 2,347a 2,523a 2,573a 2,416a 2,365a 2,408a 2,296a 2,265a
4 "Mental He… 611a   635a   716a   657a   587a   583a   567a   537a   547a  
5 "Partial D… 440    425    439    449    471    416    374a   340a   362a  
6 "In-Home T… 693a   656a   762a   731a   719a   707a   716a   657a   674a  
# … with 8 more variables: `2011` <chr>, `2012` <chr>, `2013` <chr>,
#   `2014` <chr>, `2015` <chr>, `2016` <chr>, `2017` <chr>, `2018` <chr>

Nice!

Now you may notice that the individual values have an "a" after the numeric value.

According to the legend this indicates if “the difference between this estimate and the 2018 estimate is significant at the .05 level.”

While this is useful information, it makes it difficult to work with our numeric values, so we want to remove them.

Since lower case “a” values occur in the first column values, we want to makes sure we don’t remove these.

So how can we do this? We can use the stringr package to modify character strings. and we can use the dplyr functions mutate(), select() and across() to specify want columns we want to change.

Currently all of our data is of class character as indicated by the <chr> under the column names.

Click here for an explanation of what a character string is

There are several classes of data in R programming. Character is one of these classes. A character string is an individual data value made up of characters. This can be a paragraph, like the legend for the table, or it can be a single letter or number like the letter "a" or the number "3". If data is of class character, than the numeric values will not be processed like a numeric value in a mathematical sense. If you want your numeric values to be interpreted that way, they need to be converted to a numeric class. The options typically used are integer (which has no decimal place) and double precision (which has a decimal place).

The stringr package has functions that allow us to replace (the str_replace() function) or remove(the str_remove() function) characters.

To use these we need to be able to specify what we want to remove and replace.

Here is a part of a cheatsheet about string manipulation from RStudio.

We can see that we can refer to any digit (such as 1,2,3 etc.) as [:digit:]. We can also see that we can refer to any punctuation mark as [:punct:]. Finally, we see that spaces and tabs can be referred to as [:blank:].

If we take a closer look at the first column of our table (using the pull() function of the dplyr package), we can see that besides the "a" values that we see adjacent to our numeric values in the body of the table, we also some large white spaces, some numeric values, instances of \r\n, as well as some commas and other punctuation marks.

 [1] "SPECIALTY MENTAL HEALTH SERVICE1"                                                                                
 [2] "Outpatient"                                                                                                      
 [3] "Private Therapist, Psychologist,\r\n   Psychiatrist, Social Worker, or\r\n   Counselor"                          
 [4] "Mental Health Clinic or Center"                                                                                  
 [5] "Partial Day Hospital or Day Treatment\r\n   Program"                                                             
 [6] "In-Home Therapist, Counselor, or Family\r\n   Preservation Worker"                                               
 [7] "Inpatient or Residential1"                                                                                       
 [8] "Hospital"                                                                                                        
 [9] "Residential Treatment Center"                                                                                    
[10] "NONSPECIALTY MENTAL HEALTH\r\nSERVICE2"                                                                          
[11] "Education2,3"                                                                                                    
[12] "School Social Worker, School Psychologist,\r\n   or School Counselor"                                            
[13] "Special School or Program within a Regular\r\n   School for Students with Emotional or\r\n   Behavioral Problems"
[14] "General Medicine"                                                                                                
[15] "Pediatrician or Other Family Doctor"                                                                             
[16] "Juvenile Justice"                                                                                                
[17] "Juvenile Detention Center, Prison, or Jail"                                                                      
[18] "Child Welfare"                                                                                                   
[19] "Foster Care or Therapeutic Foster Care"                                                                          
[20] "SPECIALTY MENTAL HEALTH SERVICES\r\nAND EDUCATION, GENERAL MEDICINE\r\nOR CHILD WELFARE SERVICES1,2,3"           

We can use the str_remove_all() function which is a variant of the str_remove() function of the stringr package (which allows us to remove all occurrences of specified characters in each row rather than just the first occurrence, which is what str_remove() does), to remove the digit values, the \r\n characters and the punctuation marks from the column called MHS_setting.

Using the mutate() function we specify that we want to change this particular column and replace it with a version of this column that has removed characters that match digits, r\n or punctuation marks.

We need to specify that the character strings that should be used can be found in the MHS_setting column by using the string = argument and the patterns to find and remove are specified using the pattern = argument.

To allow us to look for all three of these patterns at the same time, we can use the | symbol between each pattern.

# A tibble: 20 x 18
   MHS_setting `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010`
   <chr>       <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
 1 SPECIALTY … 2,898a 3,065a 3,348a 3,362a 3,255a 3,104a 3,129a 2,925a 2,920a
 2 Outpatient  2,662a 2,795a 3,015a 3,048a 2,931a 2,787a 2,837a 2,650a 2,635a
 3 Private Th… 2,254a 2,347a 2,523a 2,573a 2,416a 2,365a 2,408a 2,296a 2,265a
 4 Mental Hea… 611a   635a   716a   657a   587a   583a   567a   537a   547a  
 5 Partial Da… 440    425    439    449    471    416    374a   340a   362a  
 6 InHome The… 693a   656a   762a   731a   719a   707a   716a   657a   674a  
 7 Inpatient … 509a   542a   629    619    596    581a   539a   524a   531a  
 8 Hospital    422a   467a   515    529    516    511a   469a   440a   447a  
 9 Residentia… 224a   233a   299    229a   225a   199a   198a   213a   217a  
10 NONSPECIAL… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   3,430a 3,465a
11 Education   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   2,931a 2,957a
12 School Soc… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   2,286  2,214 
13 Special Sc… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   975a   1,054a
14 General Me… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
15 Pediatrici… 657a   732    840    810    694    692    710    605a   601a  
16 Juvenile J… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
17 Juvenile D… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   109a   80a   
18 Child Welf… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
19 Foster Car… 157a   179a   158a   143a   129    114    118    92     108   
20 SPECIALTY … <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   1,226a 1,283a
# … with 8 more variables: `2011` <chr>, `2012` <chr>, `2013` <chr>,
#   `2014` <chr>, `2015` <chr>, `2016` <chr>, `2017` <chr>, `2018` <chr>

We also want to replace the spaces with a single space. We can see that sometimes there appears to be more than one space. We can specify that we want any occurrence of 1 or more to be replaced by using the {1,} notation.

See here for an explanation of this on the cheat sheet:

So now we will use the str_replace_all() function of the stringr package. In this case we also need to specify a replacement with the replacement = argument.

# A tibble: 20 x 18
   MHS_setting `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010`
   <chr>       <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
 1 SPECIALTY … 2,898a 3,065a 3,348a 3,362a 3,255a 3,104a 3,129a 2,925a 2,920a
 2 Outpatient  2,662a 2,795a 3,015a 3,048a 2,931a 2,787a 2,837a 2,650a 2,635a
 3 Private Th… 2,254a 2,347a 2,523a 2,573a 2,416a 2,365a 2,408a 2,296a 2,265a
 4 Mental Hea… 611a   635a   716a   657a   587a   583a   567a   537a   547a  
 5 Partial Da… 440    425    439    449    471    416    374a   340a   362a  
 6 InHome The… 693a   656a   762a   731a   719a   707a   716a   657a   674a  
 7 Inpatient … 509a   542a   629    619    596    581a   539a   524a   531a  
 8 Hospital    422a   467a   515    529    516    511a   469a   440a   447a  
 9 Residentia… 224a   233a   299    229a   225a   199a   198a   213a   217a  
10 NONSPECIAL… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   3,430a 3,465a
11 Education   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   2,931a 2,957a
12 School Soc… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   2,286  2,214 
13 Special Sc… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   975a   1,054a
14 General Me… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
15 Pediatrici… 657a   732    840    810    694    692    710    605a   601a  
16 Juvenile J… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
17 Juvenile D… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   109a   80a   
18 Child Welf… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
19 Foster Car… 157a   179a   158a   143a   129    114    118    92     108   
20 SPECIALTY … <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   1,226a 1,283a
# … with 8 more variables: `2011` <chr>, `2012` <chr>, `2013` <chr>,
#   `2014` <chr>, `2015` <chr>, `2016` <chr>, `2017` <chr>, `2018` <chr>

Now to finally remove the “a” values and the commas from the body of the table we can use str_remove_all() function yet again. However this time to specify that we want all columns except the first column called MHS_setting, we can use the across() function of the dplyr package. This allows us to specify what columns we want to mutate by using the .cols = argument. We can select all columns except the first column called MHS_setting by using a minus sign - in front of the column name.

# A tibble: 20 x 18
   MHS_setting `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010`
   <chr>       <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
 1 SPECIALTY … 2898   3065   3348   3362   3255   3104   3129   2925   2920  
 2 Outpatient  2662   2795   3015   3048   2931   2787   2837   2650   2635  
 3 Private Th… 2254   2347   2523   2573   2416   2365   2408   2296   2265  
 4 Mental Hea… 611    635    716    657    587    583    567    537    547   
 5 Partial Da… 440    425    439    449    471    416    374    340    362   
 6 InHome The… 693    656    762    731    719    707    716    657    674   
 7 Inpatient … 509    542    629    619    596    581    539    524    531   
 8 Hospital    422    467    515    529    516    511    469    440    447   
 9 Residentia… 224    233    299    229    225    199    198    213    217   
10 NONSPECIAL… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   3430   3465  
11 Education   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   2931   2957  
12 School Soc… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   2286   2214  
13 Special Sc… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   975    1054  
14 General Me… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
15 Pediatrici… 657    732    840    810    694    692    710    605    601   
16 Juvenile J… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
17 Juvenile D… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   109    80    
18 Child Welf… <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
19 Foster Car… 157    179    158    143    129    114    118    92     108   
20 SPECIALTY … <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   1226   1283  
# … with 8 more variables: `2011` <chr>, `2012` <chr>, `2013` <chr>,
#   `2014` <chr>, `2015` <chr>, `2016` <chr>, `2017` <chr>, `2018` <chr>

Our table is looking much better!

We also want to change our values to be numeric as opposed to character so that we can use them in mathematical functions. We can use the base as.numeric() function. Again we will use the across() function to indicate what variables we wish to mutate.

# A tibble: 20 x 18
   MHS_setting `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010`
   <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 SPECIALTY …   2898   3065   3348   3362   3255   3104   3129   2925   2920
 2 Outpatient    2662   2795   3015   3048   2931   2787   2837   2650   2635
 3 Private Th…   2254   2347   2523   2573   2416   2365   2408   2296   2265
 4 Mental Hea…    611    635    716    657    587    583    567    537    547
 5 Partial Da…    440    425    439    449    471    416    374    340    362
 6 InHome The…    693    656    762    731    719    707    716    657    674
 7 Inpatient …    509    542    629    619    596    581    539    524    531
 8 Hospital       422    467    515    529    516    511    469    440    447
 9 Residentia…    224    233    299    229    225    199    198    213    217
10 NONSPECIAL…     NA     NA     NA     NA     NA     NA     NA   3430   3465
11 Education       NA     NA     NA     NA     NA     NA     NA   2931   2957
12 School Soc…     NA     NA     NA     NA     NA     NA     NA   2286   2214
13 Special Sc…     NA     NA     NA     NA     NA     NA     NA    975   1054
14 General Me…     NA     NA     NA     NA     NA     NA     NA     NA     NA
15 Pediatrici…    657    732    840    810    694    692    710    605    601
16 Juvenile J…     NA     NA     NA     NA     NA     NA     NA     NA     NA
17 Juvenile D…     NA     NA     NA     NA     NA     NA     NA    109     80
18 Child Welf…     NA     NA     NA     NA     NA     NA     NA     NA     NA
19 Foster Car…    157    179    158    143    129    114    118     92    108
20 SPECIALTY …     NA     NA     NA     NA     NA     NA     NA   1226   1283
# … with 8 more variables: `2011` <dbl>, `2012` <dbl>, `2013` <dbl>,
#   `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>

We would also like to add a type and subtype variable, that specifies the general categories of settings where services were received, as well as remove a couple of rows that are completely empty. These are the rows where the first column values are Genearl_Medicine and Juvenile_Justice, and Child Welfare. If we look at the website, we can see that these were leading lines with no data.

First we will add the type and subtype variables using the mutate function.

We also want to add a variable with shorter names for labels in plots and statistical output.

We can remove the empty rows using the filter() function of the dplyr package. We can specify that we don’t want to keep these rows by using the != not equal to operator.

# A tibble: 20 x 21
   MHS_setting `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010`
   <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 SPECIALTY …   2898   3065   3348   3362   3255   3104   3129   2925   2920
 2 Outpatient    2662   2795   3015   3048   2931   2787   2837   2650   2635
 3 Private Th…   2254   2347   2523   2573   2416   2365   2408   2296   2265
 4 Mental Hea…    611    635    716    657    587    583    567    537    547
 5 Partial Da…    440    425    439    449    471    416    374    340    362
 6 InHome The…    693    656    762    731    719    707    716    657    674
 7 Inpatient …    509    542    629    619    596    581    539    524    531
 8 Hospital       422    467    515    529    516    511    469    440    447
 9 Residentia…    224    233    299    229    225    199    198    213    217
10 NONSPECIAL…     NA     NA     NA     NA     NA     NA     NA   3430   3465
11 Education       NA     NA     NA     NA     NA     NA     NA   2931   2957
12 School Soc…     NA     NA     NA     NA     NA     NA     NA   2286   2214
13 Special Sc…     NA     NA     NA     NA     NA     NA     NA    975   1054
14 General Me…     NA     NA     NA     NA     NA     NA     NA     NA     NA
15 Pediatrici…    657    732    840    810    694    692    710    605    601
16 Juvenile J…     NA     NA     NA     NA     NA     NA     NA     NA     NA
17 Juvenile D…     NA     NA     NA     NA     NA     NA     NA    109     80
18 Child Welf…     NA     NA     NA     NA     NA     NA     NA     NA     NA
19 Foster Car…    157    179    158    143    129    114    118     92    108
20 SPECIALTY …     NA     NA     NA     NA     NA     NA     NA   1226   1283
# … with 11 more variables: `2011` <dbl>, `2012` <dbl>, `2013` <dbl>,
#   `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>,
#   type <chr>, subtype <chr>, short_label <chr>

Finally, we would like to change the shape of our table so that we have a new column that represents the year and a new column that represents the value for that year. To do so we will be making our table “longer”, meaning that it will have fewer rows and more columns. See here for more information about different table formats, typically referred to as wide and long or sometimes narrow.

We will use the pivot_longer() function of the tidyr package to change the shape of our table.

There are 3 main arguments in this function:
1) cols - which specifies what columns to collapse
2) names_to - which specifies the name of the new column that will be created that will contain the column names of the columns you are collapsing
3) values_to - which specifies the name of the new column that will be created that will contain the values from the columns you are collapsing

To specify that we want to collapse all columns except the MHS_setting column we can again use the minus sign. Finally, we will make the Year variable numeric as well.

# A tibble: 340 x 6
   MHS_setting                type     subtype       short_label     Year Number
   <chr>                      <chr>    <chr>         <chr>          <dbl>  <dbl>
 1 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty tot…  2002   2898
 2 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty tot…  2003   3065
 3 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty tot…  2004   3348
 4 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty tot…  2005   3362
 5 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty tot…  2006   3255
 6 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty tot…  2007   3104
 7 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty tot…  2008   3129
 8 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty tot…  2009   2925
 9 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty tot…  2010   2920
10 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty tot…  2011   3101
# … with 330 more rows

We can see that our table is now much longer - as we have 340 rows!

Question Opportunity

Why do we have 340 rows now?

Now we see that the Year and Number variables are of class double because of the <dbl> under the column name.

Let’s take a look at what the rest of the tables contain:

Table Details
Table 11.1A Settings Where Mental Health Services Were Received in Past Year among Persons Aged 12 to 17: Numbers in Thousands, 2002-2018
Table 11.1B Settings Where Mental Health Services Were Received in Past Year among Persons Aged 12 to 17: Percentages, 2002-2018
Table 11.2A Major Depressive Episode in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Numbers in Thousands, 2004-2018
Table 11.2B Major Depressive Episode in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Percentages, 2004-2018
Table 11.3A Major Depressive Episode with Severe Impairment in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Numbers in Thousands, 2006-2018
Table 11.3B Major Depressive Episode with Severe Impairment in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Percentages, 2006-2018
Table 11.4A Receipt of Treatment for Depression in Past Year among Persons Aged 12 to 17 with Major Depressive Episode in Past Year, by Demographic Characteristics: Numbers in Thousands, 2004-2018
Table 11.4B Receipt of Treatment for Depression in Past Year among Persons Aged 12 to 17 with Major Depressive Episode in Past Year, by Demographic Characteristics: Percentages, 2004-2018

OK, so the next table is very similar to Table11.1A, while the remaining tables have information about demographics.

As a reminder here are all of the steps that we performed to wrangle table11.1a:

table11.1a %<>%
  dplyr::as_tibble() %>%
  slice(1:(n()-1))%>%
  dplyr::na_if("nc") %>%
  dplyr::na_if("--") %>%
  dplyr::na_if("") %>%
  dplyr::na_if("*")%>%
  dplyr::rename(MHS_setting = 
                  `Setting Where Mental Health ServiceWas Received`) %>%
  mutate(MHS_setting = 
         str_remove_all(string = MHS_setting, 
                       pattern = "[:digit:]|\r\n|[:punct:]|")) %>%
  mutate(MHS_setting = 
         str_replace_all(string = MHS_setting,
                        pattern = "[:blank:]{1,}", 
                    replacement = " ")) %>%
  mutate(dplyr::across(.cols = -MHS_setting,
                stringr::str_remove_all, "a|,")) %>%
  mutate(across(-MHS_setting, as.numeric)) %>%
  mutate(type = c(rep("Specialty", 9), rep("Nonspecialty", 11))) %>%
  mutate(subtype =c("Specialty_total", 
                    rep("Outpatient", 5), 
                    rep("Inpatient", 3), 
                    "Nonspecialty_total", 
                    rep("Education", 3), 
                    rep("General_medicine", 2),
                    rep("Juvenile_Justice", 2),
                    rep("Child_Welfare", 2), 
                    "combination")) %>%
  mutate(short_label = c("Specialty total", "Outpatient total", "Therapist", "Clinic", "Day program", "In-home Therapist", "Inpatient total", "Hospital", "Residential Center", "Nonspecialty total", "School total", "School Therapist", "School Program", "General Medicine", "Family Dr", "Justice System", "Justice System", "Welfare", "Fostercare", "Specialty Combination")) %>%
  dplyr::filter(MHS_setting != "General_Medicine") %>%
  dplyr::filter(MHS_setting != "Juvenile_Justice") %>%
  dplyr::filter(MHS_setting != "Child_Welfare") %>%
    tidyr::pivot_longer(cols = contains("20"), 
                  names_to = "Year",
                 values_to = "Number") %>%
   mutate(Year = as.numeric(Year))

Now we want to wrangle table11.1B which is formatted the most similarly. To do so we can simply run these steps on the using the table11.1B as the input. For the sake of education however, we will show you how you could make a function if we had several more similar tables to wrangle. This will also make it easier to write a function to wrangle the other demographic tables.

Last time we wrote a function in this case study, we only had one input in our function. This time we will have several inputs. We will have the table that we want to wrangle as TABLE, a new name for the first column called new_col, and an input called pivot_col which will be the name of the column that will be created after pivoting that will take the values from each of the years.

We will also add code to remove all rows that have only NA values, this means we don’t need to know what rows ahead of time.

To do this we will use the filter() and select() functions of the dplyr package.

We will calculate a sum of the count of NA values across the rows for the numeric columns (the columns for each year) using the base rowSums() function.

To do this we first select the columns that are numeric using: select(., is.numeric), where the . refers to the table after all the previous wrangling steps in our function.

Then we get a true or false statement about which columns have NA values with the base is.na() function (this requires numeric values).

Then we calculate the sum using the base rowSums() function.

Altogether this looks like this: rowSums(is.na(select(., is.numeric))). Finally we compare this to the number of columns that are numeric by using: length(select(., is.numeric))), with the idea that if the number of NA values is less than the number of columns that could have NA values, then we know it is not an empty row.

Note that if we were using the summarise() or mutate() function or the dplyr package, then we could use the across() function of the dplyr package to select what columns we wanted to use in our calculation.

Now we can apply the function to table11.1b.

Table11.1b

# A tibble: 289 x 6
   MHS_setting                type     subtype       short_label    Year Percent
   <chr>                      <chr>    <chr>         <chr>         <dbl>   <dbl>
 1 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty to…  2002    11.8
 2 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty to…  2003    12.4
 3 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty to…  2004    13.4
 4 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty to…  2005    13.4
 5 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty to…  2006    13  
 6 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty to…  2007    12.4
 7 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty to…  2008    12.7
 8 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty to…  2009    12  
 9 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty to…  2010    12.1
10 SPECIALTY MENTAL HEALTH S… Special… Specialty_to… Specialty to…  2011    12.6
# … with 279 more rows

Great!

What about the subsequent tables?

Demographic Tables

All of the rest of the tables have demographic information and have this general structure:

In these tables we have age groups in our first column so we don’t want to remove digits or punctuation marks anymore so we need to modify our function a bit to remove that step.

We also want to add the word Age and an underscore in front of the age group listed in the tables. We can use the str_replace() function of the stringr package, because now we want to only replace the first instance of 1 with Age_1.

We also plan to replace the first column name with Demographic for all of the tables.

We also want to create a new variable that list the subgroups.

We will also want to only wrangle the data up to the point that we change the shape of the data, so that we can check how the data looks first.

from Michael

We use the function to wrangle the next set of tables. We will also add a column to describe what the data is about which will be helpful for merging the data later.

Rows: 14
Columns: 18
$ Demographic <chr> "TOTAL", "Age_12-13", "Age_14-15", "Age_16-17", "Male", "…
$ `2004`      <dbl> 2225, 445, 783, 997, 637, 1588, 1848, 1413, 287, 13, NA, …
$ `2005`      <dbl> 2191, 417, 811, 964, 571, 1620, 1802, 1390, 288, 10, NA, …
$ `2006`      <dbl> 1970, 383, 684, 902, 539, 1431, 1614, 1220, 245, 12, NA, …
$ `2007`      <dbl> 2016, 337, 705, 974, 586, 1430, 1691, 1284, 292, 7, NA, 6…
$ `2008`      <dbl> 2027, 366, 706, 955, 540, 1487, 1681, 1266, 261, 13, NA, …
$ `2009`      <dbl> 1954, 330, 741, 883, 577, 1377, 1594, 1180, 285, 9, NA, 7…
$ `2010`      <dbl> 1911, 330, 706, 876, 536, 1375, 1537, 1186, 236, 10, 1, 5…
$ `2011`      <dbl> 2011, 312, 710, 989, 566, 1446, 1586, 1167, 238, 15, NA, …
$ `2012`      <dbl> 2213, 420, 844, 950, 581, 1632, 1648, 1231, 269, 7, NA, 4…
$ `2013`      <dbl> 2587, 470, 1025, 1091, 657, 1930, 1970, 1449, 289, 7, NA,…
$ `2014`      <dbl> 2751, 548, 986, 1217, 710, 2042, 2123, 1578, 306, 11, NA,…
$ `2015`      <dbl> 3031, 587, 1163, 1281, 725, 2306, 2323, 1742, 302, NA, NA…
$ `2016`      <dbl> 3089, 548, 1115, 1427, 786, 2303, 2366, 1781, 301, 16, NA…
$ `2017`      <dbl> 3214, 516, 1200, 1498, 845, 2369, 2419, 1785, 313, 24, NA…
$ `2018`      <dbl> 3482, 632, 1250, 1600, 946, 2537, 2601, 1905, 334, 23, NA…
$ subgroup    <chr> "Total", "Age", "Age", "Age", "Sex", "Sex", "Race/Ethnici…
$ data_type   <chr> "Major_Depressive_Episode", "Major_Depressive_Episode", "…
Rows: 14
Columns: 18
$ Demographic <chr> "TOTAL", "Age_12-13", "Age_14-15", "Age_16-17", "Male", "…
$ `2004`      <dbl> 9.0, 5.4, 9.2, 12.3, 5.0, 13.1, 8.9, 9.2, 7.7, 7.8, NA, 8…
$ `2005`      <dbl> 8.8, 5.2, 9.5, 11.5, 4.5, 13.3, 8.7, 9.1, 7.6, 6.1, NA, 6…
$ `2006`      <dbl> 7.9, 4.9, 7.9, 10.7, 4.2, 11.8, 7.9, 8.2, 6.4, 9.3, NA, 7…
$ `2007`      <dbl> 8.2, 4.3, 8.4, 11.5, 4.6, 11.9, 8.4, 8.7, 7.8, 4.6, NA, 6…
$ `2008`      <dbl> 8.3, 4.9, 8.5, 11.2, 4.3, 12.5, 8.5, 8.8, 7.1, 10.1, NA, …
$ `2009`      <dbl> 8.1, 4.6, 8.8, 10.4, 4.7, 11.7, 8.2, 8.4, 7.9, 7.5, NA, 7…
$ `2010`      <dbl> 8.0, 4.3, 9.0, 10.6, 4.4, 11.9, 8.1, 8.6, 6.8, 7.4, 1.8, …
$ `2011`      <dbl> 8.2, 4.1, 8.6, 11.7, 4.5, 12.1, 8.3, 8.6, 7.0, 11.4, NA, …
$ `2012`      <dbl> 9.1, 5.4, 10.2, 11.4, 4.7, 13.7, 8.7, 9.1, 7.9, 5.2, NA, …
$ `2013`      <dbl> 10.7, 6.1, 12.4, 13.2, 5.3, 16.2, 10.4, 10.9, 8.6, 4.5, N…
$ `2014`      <dbl> 11.4, 7.2, 11.9, 14.6, 5.7, 17.3, 11.3, 12.0, 9.1, 6.9, N…
$ `2015`      <dbl> 12.5, 7.8, 13.8, 15.5, 5.8, 19.5, 12.5, 13.4, 9.0, NA, NA…
$ `2016`      <dbl> 12.8, 7.3, 13.3, 17.2, 6.4, 19.4, 12.8, 13.8, 9.1, 11.5, …
$ `2017`      <dbl> 13.3, 6.9, 14.5, 17.7, 6.8, 20.0, 13.1, 14.0, 9.5, 16.3, …
$ `2018`      <dbl> 14.4, 8.4, 15.3, 19.0, 7.7, 21.5, 14.2, 15.1, 10.3, 15.2,…
$ subgroup    <chr> "Total", "Age", "Age", "Age", "Sex", "Sex", "Race/Ethnici…
$ data_type   <chr> "Major_Depressive_Episode", "Major_Depressive_Episode", "…
Rows: 13
Columns: 16
$ Demographic <chr> "TOTAL", "Age_12-13", "Age_14-15", "Age_16-17", "Male", "…
$ `2006`      <dbl> 1358, 211, 518, 629, 335, 1023, 1118, 871, 150, 9, 54, 32…
$ `2007`      <dbl> 1371, 200, 500, 671, 386, 986, 1141, 873, 193, 4, 39, 32,…
$ `2008`      <dbl> 1460, 239, 505, 716, 359, 1101, 1226, 944, 171, 8, 44, 50…
$ `2009`      <dbl> 1404, 235, 521, 648, 391, 1013, 1150, 858, 204, 5, 48, 31…
$ `2010`      <dbl> 1350, 232, 479, 639, 395, 954, 1093, 853, 157, 7, 44, 30,…
$ `2011`      <dbl> 1388, 218, 487, 683, 397, 991, 1113, 799, 183, 13, 60, 57…
$ `2012`      <dbl> 1544, 285, 590, 669, 373, 1172, 1152, 883, 164, 4, 30, 64…
$ `2013`      <dbl> 1868, 314, 752, 801, 435, 1432, 1425, 1046, 207, 6, 98, 6…
$ `2014`      <dbl> 1990, 375, 707, 909, 461, 1529, 1540, 1167, 214, 8, 80, 6…
$ `2015`      <dbl> 2129, 388, 826, 915, 477, 1652, 1651, 1258, 198, NA, 67, …
$ `2016`      <dbl> 2168, 354, 789, 1025, 539, 1629, 1703, 1290, 196, 8, 118,…
$ `2017`      <dbl> 2265, 332, 861, 1072, 581, 1684, 1694, 1256, 233, 6, 101,…
$ `2018`      <dbl> 2423, 382, 891, 1150, 628, 1795, 1841, 1354, 225, NA, 127…
$ subgroup    <chr> "Total", "Age", "Age", "Age", "Sex", "Sex", "Race/Ethnici…
$ data_type   <chr> "Severe_Major_Depressive_Episode", "Severe_Major_Depressi…
Rows: 13
Columns: 16
$ Demographic <chr> "TOTAL", "Age_12-13", "Age_14-15", "Age_16-17", "Male", "…
$ `2006`      <dbl> 5.5, 2.7, 6.0, 7.5, 2.6, 8.4, 5.5, 5.8, 3.9, 6.6, 5.3, 8.…
$ `2007`      <dbl> 5.5, 2.5, 6.0, 7.9, 3.0, 8.2, 5.7, 5.9, 5.1, 2.6, 3.9, 7.…
$ `2008`      <dbl> 6.0, 3.2, 6.1, 8.4, 2.9, 9.3, 6.2, 6.5, 4.6, 6.5, 4.7, 10…
$ `2009`      <dbl> 5.8, 3.2, 6.2, 7.7, 3.2, 8.6, 5.9, 6.1, 5.7, 4.3, 5.0, 6.…
$ `2010`      <dbl> 5.7, 3.0, 6.1, 7.7, 3.2, 8.2, 5.7, 6.2, 4.5, 5.4, 4.3, 5.…
$ `2011`      <dbl> 5.7, 2.8, 5.9, 8.1, 3.2, 8.3, 5.8, 5.9, 5.4, 9.8, 5.0, 8.…
$ `2012`      <dbl> 6.3, 3.7, 7.1, 8.0, 3.0, 9.8, 6.1, 6.5, 4.8, 2.6, 2.6, 9.…
$ `2013`      <dbl> 7.7, 4.1, 9.1, 9.7, 3.5, 12.0, 7.6, 7.8, 6.2, 3.8, 8.1, 8…
$ `2014`      <dbl> 8.2, 4.9, 8.5, 10.9, 3.7, 13.0, 8.2, 8.9, 6.4, 4.9, 6.6, …
$ `2015`      <dbl> 8.8, 5.1, 9.8, 11.1, 3.8, 14.0, 8.9, 9.7, 5.9, NA, 5.5, 1…
$ `2016`      <dbl> 9.0, 4.7, 9.4, 12.4, 4.4, 13.7, 9.2, 10.0, 6.0, 5.7, 9.3,…
$ `2017`      <dbl> 9.4, 4.4, 10.4, 12.7, 4.7, 14.2, 9.2, 9.8, 7.1, 3.9, 7.9,…
$ `2018`      <dbl> 10.0, 5.1, 10.9, 13.7, 5.1, 15.2, 10.1, 10.7, 6.9, NA, 9.…
$ subgroup    <chr> "Total", "Age", "Age", "Age", "Sex", "Sex", "Race/Ethnici…
$ data_type   <chr> "Severe_Major_Depressive_Episode", "Severe_Major_Depressi…
Rows: 11
Columns: 18
$ Demographic <chr> "TOTAL", "Age_12-13", "Age_14-15", "Age_16-17", "Male", "…
$ `2004`      <dbl> 895, 169, 278, 448, 239, 656, 756, 633, 82, NA, 139
$ `2005`      <dbl> 822, 136, 329, 357, 193, 629, 700, 544, 113, NA, 122
$ `2006`      <dbl> 760, 133, 263, 364, 189, 571, 634, 502, 70, NA, 126
$ `2007`      <dbl> 782, 137, 259, 386, 214, 568, 691, 545, 116, NA, 91
$ `2008`      <dbl> 764, 122, 236, 405, 183, 581, 658, 545, 85, NA, 105
$ `2009`      <dbl> 673, 98, 244, 331, 168, 505, 555, 444, 67, NA, 118
$ `2010`      <dbl> 721, 106, 271, 343, 171, 549, 577, 487, 54, NA, 144
$ `2011`      <dbl> 769, 112, 258, 400, 199, 570, 645, 482, 97, NA, 125
$ `2012`      <dbl> 813, 127, 307, 379, 163, 650, 642, 500, 90, NA, 171
$ `2013`      <dbl> 977, 181, 376, 420, 193, 784, 753, 598, 83, NA, 224
$ `2014`      <dbl> 1122, 194, 394, 535, 265, 857, 918, 723, 123, NA, 204
$ `2015`      <dbl> 1186, 185, 472, 530, 262, 924, 936, 702, 127, 55, 251
$ `2016`      <dbl> 1249, 189, 455, 605, 260, 989, 1008, 799, 102, NA, 241
$ `2017`      <dbl> 1330, 193, 453, 684, 274, 1056, 1072, 847, 110, 62, 258
$ `2018`      <dbl> 1432, 251, 514, 667, 351, 1081, 1100, 870, 115, 48, 332
$ subgroup    <chr> "Total", "Age", "Age", "Age", "Sex", "Sex", "Race/Ethnici…
$ data_type   <chr> "treatment", "treatment", "treatment", "treatment", "trea…
Rows: 11
Columns: 18
$ Demographic <chr> "TOTAL", "Age_12-13", "Age_14-15", "Age_16-17", "Male", "…
$ `2004`      <dbl> 40.3, 38.2, 35.5, 45.0, 37.7, 41.3, 41.0, 44.9, 28.9, NA,…
$ `2005`      <dbl> 37.8, 32.9, 41.1, 37.1, 34.1, 39.0, 39.0, 39.3, 39.3, NA,…
$ `2006`      <dbl> 38.8, 35.1, 38.4, 40.7, 35.3, 40.2, 39.5, 41.3, 29.1, NA,…
$ `2007`      <dbl> 39.0, 41.5, 36.8, 39.8, 36.7, 40.0, 41.1, 42.7, 39.7, NA,…
$ `2008`      <dbl> 37.7, 33.5, 33.6, 42.4, 34.0, 39.1, 39.3, 43.1, 32.4, NA,…
$ `2009`      <dbl> 34.6, 30.0, 33.2, 37.5, 29.2, 36.9, 35.0, 37.7, 23.9, NA,…
$ `2010`      <dbl> 37.8, 32.5, 38.4, 39.3, 32.0, 40.1, 37.6, 41.1, 23.0, NA,…
$ `2011`      <dbl> 38.4, 36.3, 36.3, 40.5, 35.3, 39.5, 40.7, 41.4, 41.0, NA,…
$ `2012`      <dbl> 37.0, 30.7, 36.6, 40.0, 28.3, 40.1, 39.0, 40.7, 33.5, NA,…
$ `2013`      <dbl> 38.1, 39.1, 37.2, 38.6, 29.7, 40.9, 38.5, 41.6, 28.6, NA,…
$ `2014`      <dbl> 41.2, 35.9, 40.1, 44.4, 37.7, 42.4, 43.5, 46.1, 40.6, NA,…
$ `2015`      <dbl> 39.3, 31.9, 40.6, 41.5, 36.3, 40.3, 40.5, 40.6, 42.0, 46.…
$ `2016`      <dbl> 40.9, 35.3, 41.3, 42.6, 33.5, 43.4, 42.9, 45.1, 34.5, NA,…
$ `2017`      <dbl> 41.5, 37.6, 37.9, 45.8, 32.5, 44.8, 44.4, 47.5, 35.1, 44.…
$ `2018`      <dbl> 41.4, 40.7, 41.2, 41.8, 37.5, 42.9, 42.6, 46.1, 34.6, 34.…
$ subgroup    <chr> "Total", "Age", "Age", "Age", "Sex", "Sex", "Race/Ethnici…
$ data_type   <chr> "treatment", "treatment", "treatment", "treatment", "trea…

Great! All of the demographic tables look good.

Now let’s combine the count data (the “a” tables) and the percent data (the “b” tables) using the bind_rows() function of the dplyr package. Which will append each of the subsequent tables together.

We can use the distinct() function of the dplyr package to check that we indeed have all the data types now in these larger tibbles.

# A tibble: 3 x 1
  data_type                      
  <chr>                          
1 Major_Depressive_Episode       
2 Severe_Major_Depressive_Episode
3 treatment                      
# A tibble: 3 x 1
  data_type                      
  <chr>                          
1 Major_Depressive_Episode       
2 Severe_Major_Depressive_Episode
3 treatment                      

Great!

Now we will reformat both the counts and percents data to be in the long format using pivot_longer() once again.

Now that we’ve wrangled the data, we can go ahead and proceed with our analysis.

Data Analysis


Recall what our main questions were:

Our main questions:

  1. How have depression rates in American youth changed since 2004, according to the NSDUH data? How have rates differed between different youth subgroups (age, gender, ethnicity)?
  2. Do mental health services appear to be reaching more youths? Again, how have rates differed between different youth subgroups (age, gender, ethnicity)?

We are specifically interested in how the frequency of recent major depressive episodes among youths differ compared to those in 2004. We are also interested how different groups differ. We will start with examining how male and females differ across time.

Since we have counts for the two sexes: males and females, and for the two years of interest, 2004 and 2018, we can conduct a Pearson’s chi-squared test for independence, which is also written like this: \({\chi}^2\).

This will allow us to compare if the frequencies of major depressive episodes differs from what we would expect by chance if the variables years and sex were independent.

The null hypothesis is that the variables are independent or that the difference in the proportions is equal to zero. Thus we want to test if sex and year are independent. In other words, we want to know if sex influences the number of major depressive episodes in 2004 and 2018. Or stated yet another way, Are major depressive episode counts across the two years associated with sex.

According to wikipedia:

Pearson’s chi-square test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.

Thus, to conduct this statistical test, we need to produce a contingency table, which in this case could also be called a 2x2 table, which is the simplest kind of contingency table (only two levels for two variables).

Here is an example of another 2x2 table:

# A tibble: 3 x 4
  Sex    Righthanded Lefthanded Total
  <chr>        <dbl>      <dbl> <dbl>
1 Male            41          9    50
2 Female          47          3    50
3 Total           88         12   100

Here are all the observed values.

We can see that there are two variables: Sex and Handedness and each have two levels (male and female for sex and right and left for handedness). The table also includes totals of each of the 4 possible groups as well as the overall total.

Now let’s create a table of expected values assuming sex and handedness are independent.

The total right-handed is 88 out of 100 which is equal to 88%. The total left-handed is 12 out of 100 which is equal to 12%.

Thus if each sex had the same amount of right-handedness and left-handedness, we would expect 88% of the males to be right-handed and 12 % to be left-handed; and we would expect the exact same proportions of 88% right-handed and 12% left-handed for the females.

Since we have 50 males and 50 females and 12% of 50 is 6 and 88% of 50 is 44, we would thus have the following table of expected frequencies:

# A tibble: 3 x 4
  Sex    Righthanded Lefthanded Total
  <chr>        <dbl>      <dbl> <dbl>
1 Male            44          6    50
2 Female          44          6    50
3 Total           88         12   100

The \({\chi}^2\) is calculated like so:

\[{\chi}^2=\sum_{j=1}^{m} \frac{(O_j - E_j)^2}{E_j}\]

Where \(O_j = j^{th}\) observed count and \(E_j = j^{th}\) expected count for the jth cell of a contingency table with \(m\) cells.

The degrees of freedom (\(df\)) is calculated like so: \(df= (r-1)(c-1)\).

Where \(r\) = # of rows and \(c\) = the # of columns.

So to calculate the \({\chi}^2\) for handedness and sex manually, we can do the following for each of the four values in the table (besides the totals) like this:

\({\chi}^2\) = fraction of the square difference of the observed minus the expected, divided by expected for right-handed males + the same for left-handed males + the same for right-handed females + the same for left-handed females

This is equal to :

\[{\chi}^2 = \frac{(41-44)^2}{44} + \frac{(9-6)^2}{6}+ \frac{(47-44)^2}{44}+ \frac{(3-6)^2}{6}\]

Which is equal to:

\({\chi}^2\) = 3.4090909

Where the degrees of freedom = \(df = 92-1)(2-1) = 1\)

What does this mean? We need to consult a chi-square distribution to determine what the p-value is and if this is significant.

#### [source

This website has a simple way to check this, without requiring you to get out a ruler and consult this plot. Note on this website the notation for \(df\) is \({\nu}\).

Plugging in 3.409091 as \({\chi}^2\) and 1 as \({\nu}\), we get a \(p-value\) of 0.06484.

Thus we do not have enough evidence to reject the null hypothesis. Therefore, we conclude that sex and handedness do appear to be independent.

See here for a more thorough explanation of the chi-square test.

Now let’s create a contingency table with our own data and see how we can implement this test in R.

It is critical that we use the counts data, and not the percentage data for our analysis, as the \({\chi}^2\) requires counts.

We will filter the count data for the Major_Depressive_Episode data, as well as for the Male and Female data from 2004 and 2018.

The following code subsets the data we need and makes the necessary manipulations so that the units of observation are appropriate. If we take a look at the title of the table we can see that the numbers in the table represent individuals by the thousands.

The resulting object contains all the values we need for out contingency table.

# A tibble: 4 x 5
  Demographic subgroup data_type                 Year  Number
  <chr>       <chr>    <chr>                    <dbl>   <dbl>
1 Male        Sex      Major_Depressive_Episode  2004  637000
2 Male        Sex      Major_Depressive_Episode  2018  946000
3 Female      Sex      Major_Depressive_Episode  2004 1588000
4 Female      Sex      Major_Depressive_Episode  2018 2537000

A contingency table can now be produced from this data which is in long format by transforming the data to wide format and re-purposing some values as row names. To reformat the data to wide format we can use the pivot_wider() function of the tidyr package.

For this function there are several important arguments:
1) names_from - this is the variable where the names of new columns will come from 2) values_from - this is the variable where the values for the new columns will come from 3) names_prefix - if we want to add a prefix to the new columns we can do so using this argument

So in our case we want to spread out the year data into two columns thus the names will come from the Year variable and the values will come from the Number variable. We want to add the word Year to the new columns. We also want the remove the subgroup and data_type variables and only keep the Demographic, Year, and Number variables. To do so we can use the select() function. Fin

# A tibble: 2 x 3
  Demographic Year2004 Year2018
  <chr>          <dbl>    <dbl>
1 Male          637000   946000
2 Female       1588000  2537000

Now we can use the column_to_rownames() function of the tibble package to make the Demographic variable levels the row names. Otherwise we would need to select the numeric columns to perform the stats test.

       Year2004 Year2018
Male     637000   946000
Female  1588000  2537000

Note that a contingency table would also have totals for all groups as well, but this is not necessary for the stats::chisq.test() function.

The chi-squared test for independence can be conducted using the stats::chisq.test() function.


    Pearson's Chi-squared test with Yates' continuity correction

data:  chi_square_11.2a
X-squared = 1461.2, df = 1, p-value < 2.2e-16

We see that the p-value is very small, which suggests that sex does influence the count of major depressive episodes across time.

How about for severe major depressive episodes?

       Year2009 Year2018
Male     391000   628000
Female  1013000  1795000

    Pearson's Chi-squared test with Yates' continuity correction

data:  chi_square_11.3a
X-squared = 1696, df = 1, p-value < 2.2e-16

There also appears to be an influence of sex on the rate of severe major depressive episodes across the years.

How about treatment?

       Year2009 Year2018
Male     168000   351000
Female   505000  1081000

    Pearson's Chi-squared test with Yates' continuity correction

data:  chi_square_11.4a
X-squared = 50.256, df = 1, p-value = 1.349e-12

There also appears to be an influence of sex on the rate at which youths received treatment across the two years.

avocado.. I think it makes sense to move the visualization section first… and use it to motivate that analysis from the year 2010 and of male and females…

Data Visualization


OK, so now we are going to make some visualizations to further explore our questions of interest.

We are going to use the ggplot2 package to create our plots.

Click here for an introduction about this package if you are new to using ggplot2

The ggplot2 package is generally intuitive for beginners because it is based on a grammar of graphics or the gg in ggplot2. The idea is that you can construct many sentences by learning just a few nouns, adjectives, and verbs. There are specific “words” that we will need to learn and once we do, you will be able to create (or “write”) hundreds of different plots.

The critical part to making graphics using ggplot2 is the data needs to be in a tidy format. Given that we have just spent time putting our data in tidy format, we are primed to take advantage of all that ggplot2 has to offer!

We will show how it is easy to pipe tidy data (output) as input to other functions that create plots. This all works because we are working within the tidyverse.

What is the ggplot() function? As explained by Hadley Wickham:

The grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinates system.

ggplot2 Terminology * ggplot - the main function where you specify the dataset and variables to plot (this is where we define the x and y variable names) * geoms - geometric objects * e.g. geom_point(), geom_bar(), geom_line(), geom_histogram() * aes - aesthetics * shape, transparency, color, fill, line types * scales - define how your data will be plotted * continuous, discrete, log, etc

The function aes() is an aesthetic mapping function inside the ggplot() object. We use this function to specify plot attributes (e.g. x and y variable names) that will not change as we add more layers.

Anything that goes in the ggplot() object becomes a global setting. From there, we use the geom objects to add more layers to the base ggplot() object. These will define what we are interested in illustrating using the data.

Major Depressive Episodes in past year

We will start by taking a look at the rate of major depressive episodes among youths over time of the various demographic groups.

OK, we will start out by using the ggplot() function to specify what data we would like to plot on each axis. We will also indicate that would like to use the Demographic variable to group our data and to color our data. This is our first layer of the plot, thus for subsequent layers we need to use a plus sign +.

Next we will use the geom_line() function of the ggplot2 package to specify that we would like to create a line plot.

Then we will add labels for the title and subtitle using the labs() function of the ggplot2 package.

Finally we will move our legend to the bottom of the plot using the theme() function which helps us control various details about our plot.

This plot very difficult to read because there are so very many groups. Now let’s look at just the total across time. We can do so by first filtering our data for total values.

It would also nice to include every year in the x-axis. We can do so by using the scale_x_continuous() function which gives us greater control about how the x-axis is displayed.

Finally, we will drop the legend since we will only have one group using legend.position = "none" and we can change the angle of the x-axis text using axis.text.x = element_text(angle = 90) within the theme() function.

We will also make the line thicker using the size = argument for the geom_line() function.

The theme_linedraw() function changes the aesthetics of the plot. See here for a list of options.

We can see that there is a steep increase after around 2011:

Let’s add a different background color for the years since 2011. We can do so by adding a geom_rect() layer before we plot the line. We just need to specify the location of the rectangle on our plot.

Now let’s look at group differences.

To make sure our plot is not too overwhelming, let’s limit to only age and sex subgroups. Thus we will filter out the data about totals and different racial/ethnic groups for now. We will also use the facet_wrap() function to make subplots based on the demographic categories, which we put in a variable called subgroups earlier when we wrangled the data.

Nice! Now it is much easier to tell how each group has changed over time.

We can also add labels directly to the lines using the directlabels package. There are several methods to do so. See here for more information about the options for adding labels with this package. We will use the "far.from.others.borders" method so that our labels don’t overlap one another. We will also use dl.trans() of the directlabels package to move the labels slightly upward (y = y +0.35) and to the left (x = x -0.1). We will use the dl.move() function of the directlabels package to move one of the labels to a particular location.

This looks very clear now!

We can see that the majority of individuals that reported experiencing a major depressive episode in the past year were in an older age bracket. We can also see that the trend has been increasing for all three age brackets since roughly 2011.

We can also see an increase for both sexes since about 2011, but there is a steeper increase for females. Furthermore Females have a much higher rate than males for all years.

Let’s make the same plot with a different shaded background for the years of the increase like we did for the total plot.

So what might be accounting for this?

This cross-cultural review article published in 2012 suggests that aspects related to life-style due to modernity may be causing increased depression rates:

Modern populations are increasingly overfed, malnourished, sedentary, sunlight-deficient, sleep-deprived, and socially-isolated. These changes in lifestyle each contribute to poor physical health and affect the incidence and treatment of depression.

And although this may be true globally, the US has been arguably experiencing these modern lifestyle changes for years prior to this steep increase in 2011.

So what might have happened in the US around this time?

There is large amount of literature about how the use of smart phone and social media may have lead to increased depression rates. See this article which links to many scientific articles on the subject.

This has been a controversial topic due to conflicting findings, likely due to focus on different sites and aspects of social media across different studies.

The article points out that the true relationship between social media use and depression may be nuanced. Some individuals who have high face-to-face levels of interaction or lack of the opportunity to interact with others face-to-face (due to various barriers like geography), may actually experience lower risk for depression with more social media use.

Furthermore, different social media platforms may vary for their influence on depression.

Instagram was released in 2010 (which is around when our plot starts to show the upward increase in major depressive episodes, particularly in females) and according to the article:

Image-driven Instagram shows up in surveys as the platform that most leads young people to report feeling anxiety, depression and worries about body image.

Furthermore, it may be secondary effects of social media use, like less physical activity or less sleep that may increase the risk for depression.

Now let’s take a look at how the rate of major depressive episodes has changed across time for different racial/ethnic groups.

Since we have so many groups, we probably don’t want to directly label the lines this time. Instead, we will rely on the legend that will be automatically created.

We will use the the fct_reorder function of the forcats package to order the racial/ethnic groups in the legend based on the last value (using tail()) of the Percent variable.

We will also manually color our lines based on a color palette called viridis, which is more discernible for individuals with color-blindness. To do so we will use the scale_color_viridis_d() function of the ggplot2 package, which is intended for coloring discrete values.

Unfortunately, there is only one value for NHOPI group, thus since this is a line plot, we do not have enough points(2 at minimum) to create a line, so lets remove this group from the plot to remove the group from the legend.

# A tibble: 15 x 5
   Demographic subgroup       data_type                 Year Percent
   <chr>       <chr>          <chr>                    <dbl>   <dbl>
 1 NHOPI       Race/Ethnicity Major_Depressive_Episode  2004    NA  
 2 NHOPI       Race/Ethnicity Major_Depressive_Episode  2005    NA  
 3 NHOPI       Race/Ethnicity Major_Depressive_Episode  2006    NA  
 4 NHOPI       Race/Ethnicity Major_Depressive_Episode  2007    NA  
 5 NHOPI       Race/Ethnicity Major_Depressive_Episode  2008    NA  
 6 NHOPI       Race/Ethnicity Major_Depressive_Episode  2009    NA  
 7 NHOPI       Race/Ethnicity Major_Depressive_Episode  2010     1.8
 8 NHOPI       Race/Ethnicity Major_Depressive_Episode  2011    NA  
 9 NHOPI       Race/Ethnicity Major_Depressive_Episode  2012    NA  
10 NHOPI       Race/Ethnicity Major_Depressive_Episode  2013    NA  
11 NHOPI       Race/Ethnicity Major_Depressive_Episode  2014    NA  
12 NHOPI       Race/Ethnicity Major_Depressive_Episode  2015    NA  
13 NHOPI       Race/Ethnicity Major_Depressive_Episode  2016    NA  
14 NHOPI       Race/Ethnicity Major_Depressive_Episode  2017    NA  
15 NHOPI       Race/Ethnicity Major_Depressive_Episode  2018    NA  

Census Bureau definitions:
1) AIAN stands for American Indian and Alaska Native
2) NHOPI stands for Native Hawaiian or Other Pacific Islander

We can see that the group of individuals who reported as being two or more races, had the highest percentages of having a major depressive episode in the past year. Those group who reported as Black or African American had the lowest percentages. However, we can see that most of the racial/ethnic groups are fairly similar and we see an increasing for most groups since around 2011. Keep in mind the limitations listed in the Limitations section as you view these findings. It is possible that this group may be less likely to report experiencing symptoms of depression.

Major Depressive Episode with Severe Impairment

Now let’s take a look at how the rate of youths reporting having a major depressive episode with severe impairment has changed over time. See the What are the data? section about how severe impairment was defined.

We can see that the majority of individuals that reported experiencing a major depressive episode with severe impairment were in an older age bracket, however there appears to be a more dramatic change in the middle age group from 2011-2012. We can see a very steep increase in the data for the females after 2011, again this is much more steep than the increase seen for males over time.

Now let’s look at racial/ethnic groups.

Census Bureau definitions:
1) AIAN stands for American Indian and Alaska Native
2) NHOPI stands for Native Hawaiian or Other Pacific Islander

We see similar trends as we saw for the previous racial/ethnic group plot. The rate is highest for those who are two are more races and lowest for those who are Black or African American. The data for the AIAN group is sparse, so it is unclear if their levels would be lowest on the last year.

Receipt of treatment for depression for youths who had a major depressive episode

Now we will take a look at those who received treatment. First let’s look overall.

Overall roughly 40 percent of youths who self-reported experiencing a major depressive episode, also received treatment for depression.

This rate is increasing overtime like the trend of those who had a major depressive episode, yet the data is much more variable from one year to the next.

There seems to be an upward trend, but it isn’t nearly as much as the trend we saw for the increase of major depressive episodes. In general, the data seems to vary much more as well.

It looks as though youths who report as being white received the most care from mental health services.

Mental Health Services

We will also take a look at where youths are receiving treatment by using values from table11.1b which has the percentage values for counts presented in table11.1a.

We can use the str_detect() function of the stringr package to filter for the values of the short_label variable that has the word total in it.

We can see that youths appear to be receiving care in nonspecialty facilities at a slightly higher rate than that of specialty facilities. However, the rates appear to be very similar and the relative differences appear to be consistent across time.

Let’s take a look at subcategories of mental health services. So now we will filter for values within the short_label variable that do not contain the word “total” by using a ! in front of the str_dectect statement.

OK, so now we know how the rates of different subgroups compare for having a major depressive episode in the past year, having a major depressive episode with severe impairment, and receiving treatment. We also know where youths are typically getting treatment. But how do the rates of having a major depressive episode in the past year, having a major depressive episode with severe impairment, and receiving treatment compare within each group?

Overall outcomes by group

We can see that a large portion of individuals experiencing a major depressive episode have an episode with severe impairment for each group. Females have a higher rate of both types of episode and of treatment. Although females have more than double the rate of reported episodes, they receive a relatively similar rate of treatment as males for depression. This suggests that females are either more likely than males to self-report depression symptoms in surveys, or females may not be receiving as much care despite the larger need.

All age groups show a similar ratio of severe major depressive episodes for those that experienced an episode.

All racial and ethnic groups also show a similar rate of severe episodes relative to general episode rate. The rate of receiving treatment is fairly similar relative to the percentage of youths that reported having symptoms for each group.

Summary


Summary Plot

Let’s make a plot that summarizes our major findings.

We will use the ggdraw() function of this package. This allows you to add labels and other plot aspects on top of existing plots. Thus if we want to add a title element to our overall plot that we will add to a combined plot of our existing plots we can use ggdraw() to start and then the draw_label() function to add text.

We can also make a subtitle in the same way.

Now we will modify some of our existing plots using the theme() function as we did before to remove the x-axis title, to change the color of the axis text and the title size and color, as well as change the titles of the plots.

For this last plot we also want to get the legend and save it as a separate object so that we can add it to our plot grid in a way that doesn’t shrink the size of our plot to accommodate the legend. We can use the get_legend() function of the cowplot package to do this. However, beforehand, we also want to change the way the legend is displayed. We can use the guides() function of the ggplot2 package to modify the legend to specify that we want the legend to be displayed in 2 rows like so guides(guide_legend(nrow = 2)).

Now we will remove the legend for this plot:

OK, now we are ready to start putting our plots together using the plot_grid() function of the cowplot package.

It is helpful to first make rows by combining the plots that we want to be displayed next to each other and then combining these with the title and subtitle, called forward.

We can use the rel_widths argument to modify how wide each plot is displayed.

Now we can combine everything together using plot_grid() yet again. Now that we have rows, we can combine everything as a single column and easily modify the relative heights using the rel_heights() function so that our title, subtitle and legend are all relative short relative to the plots. We will make the first row half the height of the second row.

Finally, we will use the png() function of the grDevices package which is automatically loaded in RStudio sessions to save the plot. We will use the here() function of the here package, to specify that we want to save it in the img directory and call it mainplot.png. We can also use this function to specify the resolution with res and in doing so, we need to save the image with size specifications to make it larger.

The dev.off() function is necessary to close the graphics device. This is good practice to allow you to create another plot again later.

avocado: I want to add something about self-reporting bias to the objectives

Synopsis

In this case study we evaluated self-reported measures of depression symptoms among youths age 12-17 in the United States, as well as the rate of youths receiving treatment for depression. We learned how to scrape data directly from the web using the rvest package and we learned how to perform and interpret a chi-square test using the chisq.test() function of the stats package.

By analyzing and plotting our data, it is clear that depression rates appear to be increasing, particularly since 2011. However, it is possible that respondents had similar rates in previous years, but now feel less stigma about responding about depression symptoms when filling out the survey. The survey has always been anonymous, but reporting bias can sometimes cause individuals to exaggerate or minimize their symptoms because of what they think researchers want their response to be or out of shame or embarrassment, among other reasons. However, the data suggests that youths may be experiencing more symptoms of depression and that the rate of increase is quite high. Now nearly a quarter of all individuals that were female and of age 12-17 reported experiencing symptoms of depression. This warrants further investigation to see if this is a product of more reporting or if indeed American females are truly more depressed. Furthermore,if they are indeed more depressed, investigation about why young females are more depressed is also of critical importance. One important limitation is that the data does not include subgroup intersections, such as rates of major depressive episodes among females of various ethnic backgrounds. Considering the very steep increase in females, this warrants further investigation about which females are particularly vulnerable and why.

Homework


Ask students to scrape tables 11.5A and 11.5B from the website which contain data about the receipt of treatment among youths who reported having a severe episode. Ask students to create plots and perform chi-square tests to evaluate how groups compare over time.

Additional Information


Acknowledgements

We would like to acknowledge Tamar Mendelson for assisting in framing the major direction of the case study.

We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.

---
title: "Open Case Studies : Mental Health of American Youth"
css: style.css
output:
  html_document:
    self_contained: yes
    code_download: yes
    highlight: tango
    number_sections: no
    theme: cosmo
    toc: yes
    toc_float: yes
  pdf_document:
    toc: yes
  word_document:
    toc: yes

---
<style>
#TOC {
  background: url("https://opencasestudies.github.io/img/logo.jpg");
  background-size: contain;
  padding-top: 240px !important;
  background-repeat: no-repeat;
}
</style>



```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE, comment = NA, echo = TRUE,
                      message = FALSE, warning = FALSE, cache = FALSE,
                      fig.align = "center", out.width = '90%')
library(here)
library(knitr)
library(magick)# to create gif
```

#### {.outline }
```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "mainplot.png"))
```

####

## {.disclaimer_block}

**Disclaimer**: The purpose of the [Open Case Studies](https://opencasestudies.github.io){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts. 

## {.license_block}

This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 [(CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/us/){target="_blank"}  United States License.

## {.reference_block}

To cite this case study please use:

Wright, Carrie, and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-youth-mental-health-case-study. Mental Health of American Youth (Version v1.0.0).


## **Motivation**
*** 

Rates of depression appear to have been increasing among American youths since around 2010 according to a recent [report](https://content.apa.org/record/2019-12578-001){target="_blank"}. A [recent study](https://pubmed.ncbi.nlm.nih.gov/24285382/){target="_blank"}) also shows that youths appear to be seeking more care from mental health services.



This case study will explore how rates of major depressive episodes have changed since the early 2000s. As well as how rates of treatment for depression of youths have changed over time.


```{r,echo = FALSE, out.width="40%"}
knitr::include_graphics(here::here("img", "k-mitch-hodge-IqSaG9zv2e0-unsplash.jpg"))
```
<span>Photo by <a href="https://unsplash.com/@kmitchhodge?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">K. Mitch Hodge</a> on <a href="https://unsplash.com/s/photos/depression?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>  


The major symptoms of a major depressive episode include:   

**S**leep disorder (increased or decreased)  
**I**nterest deficit (anhedonia)  
**G**uilt (worthlessness, hopelessness, regret)  
**E**nergy deficit  
**C**oncentration deficit  
**A**ppetite disorder (increased or decreased)  
**P**sychomotor retardation or agitation  
**S**uicidality  

#### [[source]](https://www.icsi.org/guideline/depression/diagnose-and-characterize-major-depression-persistent-depressive-disorder-with-clinical-interview/){target="_blank"}  


```{r, echo = FALSE, out.width="80%"}
knitr::include_graphics(here::here("img", "depression-symptoms-and-treatment-768x768.jpg"))
```

#### [[source]](https://newmilfordcounselingcenter.com/depression-2/){target="_blank"}  

<details> <summary> Click here to see the diagnostic requirements for a major depressive episode according to the [DSM 5](https://en.wikipedia.org/wiki/DSM-5){target="_blank"}</summary>    

Five or more of the following symptoms have been present and documented during the same two-week period and represent a change from previous functioning; at least one of the symptoms is either (1) depressed mood or (2) loss of interest or pleasure.  

Note: Do not include symptoms that are clearly attributable to another medical condition.  

1) Depressed mood most of the day, nearly every day, as indicated by either subjective report (e.g., feels sad, empty, hopeless) or observation made by others (e.g., appears tearful)  

2) Markedly diminished interest or pleasure in all, or almost all, activities most of the day, nearly every day (as indicated by either subjective account or observation)  

3) Significant weight loss when not dieting or weight gain (e.g., a change of more than 5% of body weight in a month), or decrease or increase in appetite nearly every day  

4) Insomnia or hypersomnia nearly every day  

5) Psychomotor agitation or retardation nearly every day (observable by others, not merely subjective feelings of restlessness or being slowed down)  

6) Fatigue or loss of energy nearly every day  

7) Feelings of worthlessness or excessive or inappropriate guilt (which may be delusional) nearly every day (not merely self-reproach or guilt about being sick)  

8) Diminished ability to think or concentrate, or indecisiveness, nearly every day (either by subjective account or as observed by others)  

9) Recurrent thoughts of death (not just fear of dying), recurrent suicidal ideation without a specific plan, or a suicide attempt or a specific plan for committing suicide  

B. The symptoms do not meet criteria for a mixed episode.  

C. The episode is not attributable to the physiological effects of a substance or to another medical condition.  

Note: Criteria A-C represent a major depressive episode.  

Note: Responses to a significant loss (e.g., bereavement, financial ruin, losses from a natural disaster, a serious medical illness or disability) may include feelings of intense sadness, rumination about the loss, insomnia, poor appetite and weight loss noted in Criterion A, which may resemble a depressive episode. Although such symptoms may be understandable or considered appropriate to the loss, the presence of a major depressive episode in addition to the normal response to a significant loss should also be carefully considered. This decision inevitably requires the exercise of clinical judgment based on the individual’s history of and the cultural norms for the expression of distress in the context of loss.  

D. The occurrence of the major depressive episode is not better explained by schizoaffective disorder, schizophrenia, schizophreniform disorder, delusional disorder, or other specified and unspecified schizophrenia spectrum and other psychotic disorders.  

E. There has never been a manic episode or a hypomanic episode.  

Note: This exclusion does not apply if all of the manic-like or hypomanic-like episodes are substance-induced or are attributable to the physiological effects of another medical condition.  

</details>  
  
    
  
This case study is motivated by the following two articles:  
  

#### {.reference_block}

Twenge JM, Cooper AB, Joiner TE, Duffy ME, Binau SG. Age, period, and cohort trends in mood disorder indicators and suicide-related outcomes in a nationally representative dataset, 2005-2017. *J Abnorm Psychol*.128,3 (2019):185-199. doi:10.1037/abn0000410


Olfson, M., Blanco, C., Wang, S., Laje, G. & Correll, C. U. National Trends in the Mental Health Care of Children, Adolescents, and Adults by Office-Based Physicians. *JAMA Psychiatry*. 71, 81 (2014):81-90. doi: 10.1001/jamapsychiatry.2013.3074.

####

The main findings of the first [article](https://content.apa.org/record/2019-12578-001){target="_blank"} are:

> Rates of major depressive episode in the last year increased 52% 2005–2017 (from 8.7% to 13.2%) among adolescents aged 12 to 17 and 63% 2009–2017 (from 8.1% to 13.2%) among young adults 18–25. 

> Serious psychological distress in the last month and suicide-related outcomes (suicidal ideation, plans, attempts, and deaths by suicide) in the last year also increased among young adults 18–25 from 2008–2017 (with a 71% increase in serious psychological distress), with less consistent and weaker increases among adults ages 26 and over. 

> Cultural trends contributing to an increase in mood disorders and suicidal thoughts and behaviors since the mid-2000s, including the rise of electronic communication and digital media and declines in sleep duration, may have had a larger impact on younger people, creating a cohort effect.

While the main findings of the second [article](https://pubmed.ncbi.nlm.nih.gov/24285382/){target="_blank"} are:

>Compared with adult mental health care, the mental health
care of young people has increased more rapidly.

>Between 1995-1998 and 2007-2010, visits resulting in mental disorder diagnoses
per 100 population increased significantly faster for youths (from 7.78 to 15.30 visits) than for
adults (from 23.23 to 28.48 visits) (interaction: P < .001). 

>Psychiatrist visits also increased
significantly faster for youths (from 2.86 to 5.71 visits).


Again while depression appear to be on the rise for youths, youths also appear to be seeking more mental health care.

In this case study we will be using data from the [National Survey on Drug Use and Health (NSDUH)](https://nsduhweb.rti.org/respweb/homepage.cfm){target="_blank"} related to treatment and major depressive episode rate to explore how this have changed over time and how different groups compare. This data was also used in the first referenced article.  


## **Main Questions**
*** 

#### {.main_question_block}
<b><u> Our main questions: </u></b>

1) How have depression rates in American youth changed since 2004, according to the NSDUH data? How have rates differed between different youth subgroups (age, gender, ethnicity)?
2) Do mental health services appear to be reaching more youths? Again, how have rates differed between different youth subgroups (age, gender, ethnicity)?


####

## **Learning Objectives** 
*** 

<u>**Statistical Learning Objectives:**</u> 

1. Define and create a contingency table.
2. Implementation of a chi-squared test for independence.
3. Interpretation of a chi-squared test for independence.

<u>**Data science Learning Objectives:**</u>

1. Scrape data directly from a website (`rvest`).
2. Subset and filter data (`dplyr`).
3. Write functions to wrangle data repetitively.
4. Work with character strings (`stringr`).
5. Reshape data into different formats (`tidyr`).
6. Data visualizations (`ggplot2`) with labels (`directlabels`) and facets for different groups.


In this case study, we will especially focus on using packages and functions from the [`Tidyverse`](https://www.tidyverse.org/){target="_blank"}, such as [`rvest`](https://github.com/tidyverse/rvest){target="_blank"}. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.

```{r, out.width = "20%", echo = FALSE, fig.align ="center"}
include_graphics("https://tidyverse.tidyverse.org/logo.png")
```

*** 

We will begin by loading the packages that we will need:

```{r}
library(here)
library(rvest)
library(dplyr)
library(magrittr)
library(stringr)
library(tidyr)
library(tibble)
library(ggplot2)
library(directlabels)
library(forcats)
library(cowplot)
```



 Package   | Use                                                                         
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data  
[rvest](https://github.com/tidyverse/rvest){target="_blank"}      | to scrape web pages  
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to scrape web pages  
[magrittr](https://magrittr.tidyverse.org/){target="_blank"}      | to scrape web pages  
[stringr](https://stringr.tidyverse.org/){target="_blank"}      | to manipulate strings  
[tidyr](https://tidyr.tidyverse.org/){target="_blank"}      | to change the shape or format of tibbles to wide and long  
[tibble](https://tibble.tidyverse.org/){target="_blank"}      | to create tibbles and convert values of a column to row names  
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}      | to create plots  
[directlabels](http://directlabels.r-forge.r-project.org/docs/index.html){target="_blank"}      | to add labels directly to lines in plots  
[forcats](https://forcats.tidyverse.org/){target="_blank"}      | to reorder factor for plot  
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"}      | to combine plots together  


The first time we use a function, we will use the `::` to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.

## **Context**
*** 

According to the CDC the rate of suicide has also increased for most age groups in the United States over the past decade and a half.

```{r, out.width = "80%", echo = FALSE, fig.align ="center"}
include_graphics("https://www.cdc.gov/nchs/images/databriefs/301-350/db309_fig1.png")
```

#### [[source]](https://www.cdc.gov/nchs/products/databriefs/db309.htm){target="_blank"}


While suicide does appear to be increasing among youths it also appears to be increasing among middle aged adults for both females and males. 

```{r, out.width = "80%", echo = FALSE, fig.align ="center"}
include_graphics("https://www.cdc.gov/nchs/images/databriefs/301-350/db309_fig2.png")
```

#### [[source]](https://www.cdc.gov/nchs/products/databriefs/db309.htm){target="_blank"}




```{r, out.width = "80%", echo = FALSE, fig.align ="center"}
include_graphics("https://www.cdc.gov/nchs/images/databriefs/301-350/db309_fig3.png")
```

#### [[source]](https://www.cdc.gov/nchs/products/databriefs/db309.htm){target="_blank"}


According to the [CDC](https://www.cdc.gov/nchs/products/databriefs/db309.htm){target="_blank"}:

> Since 2008, suicide has ranked as the 10th leading cause of death for all ages in the United States. In 2016, suicide became the **second leading cause of death** among those aged **10–34** and the fourth leading cause among those aged 35–54.




```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","mortality.png"))
```

#### [[source]](https://www.cdc.gov/nchs/data/databriefs/db293.pdf){target="_blank"}   

  

**So although suicide is on the rise for most age groups, suicide is one of the top *two* contributors to death for youths.**   

Thus this warrants further examination of the mental health of American youths.  


```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","mortality_age.png"))
```

#### [[source]](https://www.cdc.gov/nchs/data/nvsr/nvsr68/nvsr68_06-508.pdf){target="_blank"}



Historically, suicide rates were much higher before 1950, however, we are seeing an increase in the last 20 years.

```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","suicide.png"))
```

#### [[source]](https://time.com/5609124/us-suicide-rate-increase/){target="_blank"}



Besides the US, [other countries](https://academic.oup.com/ije/article/48/5/1650/5366210){target="_blank"} are also experiencing increased rates of depression in youths. See [this report](https://apps.who.int/iris/bitstream/handle/10665/254610/WHO-MSD-MER-2017.2-eng.pdf;jsessionid=E44360055DD83EAC472AA40C2853DBFA?sequence=1){target="_blank"} from the  World Health Organization about rates of depression in other countries.

See [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3330161/){target="_blank"} for an interesting discussion about what may be causing increased depression rates. 


According to the [National Institute of Mental Health (NIMH)](https://www.nimh.nih.gov/health/publications/teen-depression/index.shtml){target="_blank"}:


#### {.emphasis_block}

If you are in crisis and need help, call this toll-free number for the **National Suicide Prevention Lifeline (NSPL)**, available 24 hours a day, every day: **1-800-273-TALK (8255)**. The service is available to everyone. The deaf and hard of hearing can contact the Lifeline via TTY at 1-800-799-4889. All calls are confidential. You can also visit the Lifeline’s website at [www.suicidepreventionlifeline.org](www.suicidepreventionlifeline.org){target="_blank"}.

The **Crisis Text Line** is another free, confidential resource available 24 hours a day, seven days a week. Text “HOME” to **741741** and a trained crisis counselor will respond to you with support and information over text message. Visit [www.crisistextline.org](www.crisistextline.org){target="_blank"}.

####

Also see [here](https://www.mhanational.org/depression-teens-0){target="_blank"} for more information about how to recognize and help youths experiencing symptoms of depression.

## **Limitations**
*** 



avocado -Michael:Perhaps "underestimates in the p-values..." is not the correct way to phrase this. I would look for a better way to word this.-Carrie:This is my attempt after Michael's... open to changes!


There are some important considerations regarding this data analysis to keep in mind: 

1) The data that we will use come from a survey and are therefore values from a sample that estimate that of the true population. In our statistical analysis we use these sample values as if they are population estimates (because this is all we have access to). Thus our results are not necessarily indicative of true differences.

2) Furthermore, the sampling mechanism utilized can introduce [selection bias](https://en.wikipedia.org/wiki/Selection_bias?oldformat=true){target="_blank"} in cases where the the [sampling methods do not produce a representative sample](https://en.wikipedia.org/wiki/Sampling_(statistics)?oldformat=true){target="_blank"}. 

3) Data is collected from human participants; this presents the *potential* for information bias, as there is the *potential* that participants in the [sampling frame](https://en.wikipedia.org/wiki/Sampling_frame?oldformat=true){target="_blank"} may for a variety of reasons report inaccurate information. 

## **What are the data?**
*** 

We will be using data from the [National Survey on Drug Use and Health (NSDUH)](https://nsduhweb.rti.org/respweb/homepage.cfm){target="_blank"} which is directed by the [Substance Abuse and Mental Health Services Administration (SAMHSA)](https://www.samhsa.gov/){target="_blank"}, an agency in the [U.S. Department of Health and Human Services (DHHS)](https://www.hhs.gov/){target="_blank"}. 

This survey started in 1971 and is conducted annually in all 50 states and the District of Columbia. Approximately 70,000 people (age 12 and up) are interviewed each year about health related issues. Only civilian, non-institutionalized individuals are included. Households are randomly selected and than a professional interviewer visits the addresses and asks one or two of the residents to interview. The interviewer brings a laptop with them that the participants use to fill out the survey which typically takes an hour to complete. If a participant chooses to participate they receive $30 in cash. All collected information is confidential and is used for disease surveillance and to guide public policy particularly focused on drug and alcohol use as well as mental health. See [here](https://nsduhweb.rti.org/respweb/about_nsduh.html){target="_blank"} for more details about the survey.

This data is made available publicly online on the [Substance Abuse & Mental Health Data Archive](https://datafiles.samhsa.gov/){target="_blank"}. 

```{r, out.width = "100%", echo = FALSE, fig.align ="center"}
include_graphics(here::here("img", "nsudh_screenshot_webpage.png"))
```

At the [website](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm){target="_blank"} for the survey data, you can see that the results are displayed in many tables. Importantly, there is no obvious way to download the data directly from this particular website.

```{r, out.width = "100%", echo = FALSE, fig.align ="center"}
include_graphics(here::here("img", "website_overview.png"))
```

If one clicks on the TOC button on the far right upper corner they will be directed to another [website](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetailedTabsTOC2018.htm#toc){target="_blank"}, where a large [pdf document](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetailedTabs2018.pdf){target="_blank"} containing of all of the results can be downloaded.

We are interested in investigating how depression rates have changed and how youths are interacting with mental health services. Thus the following tables are of interest to us are:

Table   | Details                                                                         
---|-------------
Table 11.1A       | Settings Where Mental Health Services Were Received in Past Year among Persons Aged 12 to 17: Numbers in Thousands, 2002-2018   
Table 11.1B       | Settings Where Mental Health Services Were Received in Past Year among Persons Aged 12 to 17: Percentages, 2002-2018  
Table 11.2A       |  Major Depressive Episode in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Numbers in Thousands, 2004-2018
Table 11.2B       | Major Depressive Episode in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Percentages, 2004-2018
Table 11.3A       | Major Depressive Episode with Severe Impairment in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Numbers in Thousands, 2006-2018
Table 11.3B       | Major Depressive Episode with Severe Impairment in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Percentages, 2006-2018
Table 11.4A       | Receipt of Treatment for Depression in Past Year among Persons Aged 12 to 17 with Major Depressive Episode in Past Year, by Demographic Characteristics: Numbers in Thousands, 2004-2018
Table 11.4B       | Receipt of Treatment for Depression in Past Year among Persons Aged 12 to 17 with Major Depressive Episode in Past Year, by Demographic Characteristics: Percentages, 2004-2018


According to the [NSDUH 2018 report](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHNationalFindingsReport2018/NSDUHNationalFindingsReport2018.pdf){target="_blank"}


> Respondents were defined as having had an MDE in
the past 12 months if they had at least one period of 2 weeks
or longer in the past year when they experienced a depressed
mood or loss of interest or pleasure in daily activities,
accompanied by problems with sleeping, eating, energy,
concentration, or self-worth. The MDE questions are based
on diagnostic criteria from DSM-5. Some of the wordings
of the depression questions for adolescents aged 12 to 17
and adults aged 18 or older differed slightly to make the
questions more developmentally appropriate for adolescents.

> Adolescents were defined as having an MDE with
severe impairment if their depression caused severe problems
with their ability to do chores at home, do well at work or
school, get along with their family, or have a social life.




## **Data Import**
*** 

Data is often made available online. Usually, the data we are interested in is made available for download on the page as a delimited text file or an excel file. However, sometimes data is not made available in this manner, such as the [NSDUH survey data](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm){target="_blank"}.

How do we proceed in this scenario?

We can manually copy each cell of data, however, this process is often inefficient, subject to error, and not reproducible. Say we wanted to run an analysis next year on the next years data and it happens to be formatted in the same way. 

We can also use `R` for web scraping. 

[Web scraping](https://en.wikipedia.org/wiki/Web_scraping?oldformat=true){target="_blank"} is the process of extracting data from a website.


### Basic steps of web scraping

There are two main steps to web scraping:  

1. Identify location of data on the webpage that will be scraped  

2. Save the webpage element to an object  

We accomplish STEP 1 with our web browser.

We accomplish STEP 2 in the `R` programming environment. 

#### {.resource_block} 

<u>Additional resources for web scraping</u>:

- [Vignette](https://rstudio-pubs-static.s3.amazonaws.com/266430_f3fd4660b2744751ab144aa130768a06.html){target="_blank"}
- [Blog](http://blog.corynissen.com/2015/01/using-rvest-to-scrape-html-table.html){target="_blank"}
- [Blog](http://research.libd.org/rstatsclub/post/introduction-to-scraping-and-wranging-tables-from-research-articles/#.Xw878ZNKhQJ){target="_blank"}
- [Selectorgadget Tool](https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html){target="_blank"}

####

In this case study we will scrape data from the tables on the [NSDUH survey](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm){target="_blank"} website. This data is available in a large PDF with all the results form the year. However it is not easy to find this PDF and it would be difficult and time consuming to find our tables of interest and to extract the data from the pdf with `pdftools`. Again, if we instead decided to copy paste the data from the website to another file that we would also need to import, this would not be as efficient or reproducible and might result in errors. 


Alternatively, we will use the `rvest` package to [scrape](https://en.wikipedia.org/wiki/Web_scraping?oldformat=true){target="_blank"} the data directly from the tables on the website. Assuming the data next year would be displayed in a similar manner, this could allow us simply modify our code based on the url for the data next year to run the same analysis on the data easily. 

The `rvest`  package can be thought of as the `pdftools` package for web scraping. Upon pulling the data, additional wrangling will likely be required; but like the `pdftools` package, `rvest` streamlines the extraction process.  

### Steps for scraping tables

The two web scraping steps for these tables can be broken down even further: 

1) Identify location of data that will be scraped

+ right-click to inspect element (webpage)
+ hover pointer over components of element (webpage) until the data has been found
+ copy XPath of data sought

2) Save webpage element to an object in R

+ import html code for the webpage
+ extract pieces of HTML documents (webpage) using XPath
+ parse the extracted data into a data frame

Below is a animated overview of the process.

```{r, echo = FALSE}
step1 <- image_read(here::here("img", 
"webpage_screenshot.png"))
step2 <- image_read(here::here("img", "table_screenshot_inspect.png"))
step3 <- image_read(here::here("img", "table_screenshot_inspect_table.png"))
step4 <- image_read(here::here("img", "table_screenshot_inspect_table_xpath.png"))
step5 <- image_read(here::here("img", "table_screenshot_xpath_copy_r.png"))
step5_zoom <- image_read(here::here("img", "table_screenshot_xpath_copy_r_zoom.png"))
```

```{r, eval=FALSE, echo=FALSE}
image_info(step5_zoom)

step5_zoom <- image_border(step5_zoom, "white", "284x334")

img <- c(step1,
         step2,
         step2,
         step3,
         step3,
         step4,
         step4,
         step5,
         step5,
         step5_zoom,
         step5_zoom,
         step5_zoom,
         step1)

educational_gif <- image_resize(img, '1440x900!') %>%
  image_background('white') %>%
  image_morph(frames = 10) %>%
  image_animate(delay = 20,
                optimize = TRUE)

image_write(educational_gif,
      here::here("img", "educational.gif"))
```

```{r, echo=FALSE}
knitr::include_graphics(here::here("img","educational.gif"))
```

***

Now let's go through each step together:

### 1) Identify location of data that will be scraped

First, let's go to the [web page](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm){target="_blank"} with all the tables we are interested in scraping

```{r, step1, echo=FALSE}
step1
```

Once on the webpage, there aren't any visible options to download the data. 

Right-click and select "Inspect" 

```{r, step2, echo=FALSE}
step2
```

A window opens. 

This window allows us to glance at the internal mechanics of the webpage. To scrape the data from the webpage, we need to first learn a little bit about the components that make it the web page it is. 

Hovering our mouse over the elements of the webpage highlights the respective section of the webpage it represents. By hovering over several elements—and clicking on the elements on the right side of the screen—we can identify the element that contains the data we are looking for. Another option for identifying XPaths is to use the [selectorgadget tool](https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html){target="_blank"}.  

```{r,step3, echo=FALSE}
step3 
```

Right click on the element and copy the XPath. We will need this XPath for the next step.

```{r, step4, echo=FALSE}
step4
```

Now we can return to the `R` programming environment.

```{r, step5, echo=FALSE}
step5
```

***

### 2) Save webpage element to an object in R 

For the first table we want to scrape, the XPath is `/html/body/div[4]/div[1]/table`. We use this XPath with functions from the `rvest` package to scrape the data from this table.


```{r,step5_zoom, echo=FALSE}
step5_zoom
```


Let's explore this step in greater detail:

We need to:

+ import html code for the webpage
+ extract pieces (table) out of HTML documents (webpage) using XPath
+ parse the html table into a data frame

To do this:

+ We import the html code using the `read_html()` function of the `rvest` package
+ We extract specific components of the webpage using the `html_nodes()` function of the `rvest` package
+ We convert this html table into a dataframe using the `html_table()`function of the `rvest` package

**The `rvest` package provides wrappers for the `xml2` and `httr` packages, thus we can just install and load the `rvest` package and it will install and load dependency packages like `xml2` and `httr` and allow us to use functions from  both of these packages.**

In fact, when we load `rvest` the first time we see:

```{r, out.width= "60%"}
knitr::include_graphics(here::here("img", "rvest.png"))
```

In this case, we are scraping table 11.1a from the website. First we assign the url to a character string to use within the `read_html()` function of the `xml2` package. 

```{r}
NSDUH_url <- "https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm"
```

One could also directly use the url but this is less convenient for piping.  

<details> <summary>Click here if you are unfamiliar with piping in R, which uses this `%>%` operator</summary>  

By [piping](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"} we mean using the `%>%` pipe operator which are usable when loading the tidyverse or several of the packages within the tidyverse like `dplyr` because they load the [`magrittr` package](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"}. This allows us to perform multiple sequential steps on one data input.   

</details>  

  
The `read_html()` function then allows us to save the html document for the webpage inside R. 

```{r}
webpage <- NSDUH_url %>%
  xml2::read_html() 
webpage
```

Then we use the `html_nodes()` function of the `rvest` package to select just the table11.1a element of the webpage.

See this [tutorial](http://flukeout.github.io/#){target="_blank"} (and the [answers](https://gist.github.com/chrisman/fcb0a88459cd98239dbe6d2d200b02d1){target="_blank"} in case you get stuck) on CSS selectors to understand more about how this function works to use the `xpath` to select the elements of interest from the webpage.


```{r}
webpage_element <-webpage %>%
  rvest::html_nodes(xpath='/html/body/div[4]/div[1]/table')
webpage_element 

```

Finally, the `html_table()` function of the `rvest` package parses the html object into a data frame.

```{r}

table11.1a<-webpage_element%>%
  rvest::html_table()
print(table11.1a, max = 2)
glimpse(table11.1a)
```

We can see that the output is a list with one element, to extract the data from the list we will use brackets `[[]]` to select the first element of the list.
```{r}
table11.1a <- table11.1a[[1]]
```


Putting this all of this together we can do the entire process like this with our pipe operator `%>%`.

```{r}
table11.1a <- NSDUH_url %>%
  xml2::read_html() %>%
  rvest::html_nodes(xpath='/html/body/div[4]/div[1]/table') %>%
  rvest::html_table()
table11.1a <- table11.1a[[1]]
```


Now need to repeat the above process for the other tables we are interested in. 

### Writing a function to scrape multiple tables

We can create a function to accomplish this succinctly. 
Functions allow us to perform the same process on multiple data inputs. See [this other case study](https://opencasestudies.github.io/ocs-bloomberg-vaping-case-study/){target="_blank"} for more details about how to write a function.

In general, the process pf writing functions involves first specifying an input that is used within the function to create an output. In this case the data input is `XPATH` which will be replaced by an actual XPath and then used in the subsequent steps to scrape the data from each table that an XPath is supplied for.

We will all this function `scraper`.

```{r}
scraper <- function(XPATH){
  NSDUH_url <- "https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm"
  table <- NSDUH_url %>%
  read_html() %>%
  html_nodes(xpath=XPATH) %>%
  html_table()
  output <- table[[1]]
  output
}
```

Now we can apply the function we created to each of the XPaths for each of the tables on the website that we would like to use in our data analysis.

```{r}
table11.1b <- scraper(XPATH = "/html/body/div[4]/div[2]/table")
table11.2a <- scraper(XPATH = '/html/body/div[4]/div[3]/table')
table11.2b <- scraper(XPATH = '/html/body/div[4]/div[4]/table')
table11.3a <- scraper(XPATH = '/html/body/div[4]/div[5]/table')
table11.3b <- scraper(XPATH = '/html/body/div[4]/div[6]/table')
table11.4a <- scraper(XPATH = '/html/body/div[4]/div[7]/table')
table11.4b <- scraper(XPATH = '/html/body/div[4]/div[8]/table')
```


Great! We have successfully scraped the data.

Now we need to wrangle the data.


## **Data Exploration and Wrangling**
*** 

Now that we've imported the data, let's see if we can wrangle a table. Since the data appears to be formatted in a similar way in each of the tables, it is likely that whatever steps we take to wrangle this first table will also be necessary in the wrangling of subsequent tables. This is because well-maintained data sources often format different datasets similarly. We can take advantage of this similarity to speed up the wrangling process. 

### **Table11.1a**

First we want to remove the last row of our data frame, which happens to be the legend of our table. Recall from looking at the website that there is a legend for this table.

```{r, echo = FALSE}
knitr::include_graphics(here::here("img", "table11.1a.png"))
```

We can take a look at the last row using the `tail` function of the `dplyr` package. We can specify that we only want to see the last row by using the `n = 1` argument.

```{r}
table11.1a %>%
  dplyr::as_tibble() %>%
  tail(n = 1)
```

We can see that the legend is repeated for every column. Let's take a look at the year 2004 column.
```{r}
table11.1a %>%
  dplyr::as_tibble() %>%  
  select(`2004`) %>%
  tail(n = 1)
```

Let's save this so that we can refer back to it later:
```{r}
legend <- table11.1a %>%
  as_tibble() %>%  
  select(`2004`) %>%
  tail(n = 1)
```

Another way to look at the last row is to use the `n()` function of the `dplyr` package. This function can be used inside other `dplyr` functions and it counts the total number of observations of a group. Within the [`slice()` function](https://dplyr.tidyverse.org/reference/slice.html){target="_blank"} of the `dplyr` package, it allows you to refer the full length of the object.

```{r}
table11.1a %>%
  dplyr::as_tibble() %>%
  slice(n()) 
```
We can use the `slice()` function of the `dplyr` package to remove this row, by using the `slice`function to select from the first row using `1:` to the second to last row using `n()-1`.

We are also going to use a special pipe operator from the [`magrittr` package](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"} called the compound assignment pipe-operator or sometimes the double pipe operator. This allows us to use the table11.1a as our input and reassign it at the end after all the steps have been performed.

We will also first change the data to be a [tibble](https://tibble.tidyverse.org/#:~:text=A%20tibble%2C%20or%20tbl_df%20%2C%20is,modern%20reimagining%20of%20the%20data.&text=Tibbles%20are%20data.,a%20variable%20does%20not%20exist).), which is the tidyverse version of a data frame using the `as_tibble()` function of the `dplyr` package and the `tibble` package.

```{r}
table11.1a %<>%
  dplyr::as_tibble() %>%
  slice(1:(n()-1))
```

Now let's take a look at data:
```{r}
table11.1a
```

Great! We can see the the legend is no longer part of the data.

Now let's use the legend to recode the data. There are many different values for missing data, that we would like to replace with `NA` instead.
We can use the `pull()` function of the `dplyr` package to take a look at the legend data:

```{r}
pull(legend, `2004`)
```

Looks like we want to replace values that are: `*`, `--`, `da`, `nc`, and `nr`. We can use the `na_if()` function to recode these values to `NA`.

avocado... there isn't support for doing this in one command... but could at least do two commands



```{r}
table11.1a %<>%
  dplyr::na_if("nc") %>%
  dplyr::na_if("--") %>%
  dplyr::na_if("") %>%
  dplyr::na_if("*")

table11.1a 
```

Now let's rename the first column using the `rename()` function of the `dplyr` package. This requires listing the new name first like so: `new_name = old_name`.
```{r}
table11.1a %<>%
  dplyr::rename(MHS_setting = 
                  `Setting Where Mental Health ServiceWas Received`)
head(table11.1a)
```

Nice!

Now you may notice that the individual values have an `"a"` after the numeric value.

According to the legend this indicates if "the difference between this estimate and the 2018 estimate is significant at the .05 level."

While this is useful information, it makes it difficult to work with our numeric values, so we want to remove them.

Since lower case "a" values occur in the first column values, we want to makes sure we don't remove these.

So how can we do this? We can use the `stringr` package to modify character strings. and we can use the `dplyr` functions `mutate()`, `select()` and `across()` to specify want columns we want to change.

Currently all of our data is of class character as indicated by the `<chr>` under the column names. 

<details> <summary> Click here for an explanation of what a character string is </summary>

There are several classes of data in R programming. Character is one of these classes. 
A character string is an individual data value made up of characters. This can be a paragraph, like the legend for the table, or it can be a single letter or number like the letter `"a"` or the number `"3"`. If data is of class character, than the numeric values will not be processed like a numeric value in a mathematical sense. If you want your numeric values to be interpreted that way, they need to be converted to a numeric class. The options typically used are integer (which has no decimal place) and double precision (which has a decimal place). 

</details>

  
The `stringr` package has functions that allow us to replace (the `str_replace()` function) or remove(the `str_remove()` function) characters. 

To use these we need to be able to specify what we want to remove and replace. 

Here is a part of a [cheatsheet](https://rstudio.com/resources/cheatsheets/){target="_blank"} about string manipulation from RStudio.
```{r}
knitr::include_graphics(here::here("img", "regex.png"))
```

We can see that we can refer to any digit (such as 1,2,3 etc.) as `[:digit:]`.
We can also see that we can refer to any punctuation mark as `[:punct:]`.
Finally, we see that spaces and tabs can be referred to as `[:blank:]`.


If we take a closer look at the first column of our table (using the `pull()` function of the `dplyr` package), we can see that besides the `"a"` values that we see adjacent to our numeric values in the body of the table, we also some large white spaces, some numeric values, instances of `\r\n`, as well as some commas and other punctuation marks. 

```{r}
pull(table11.1a, MHS_setting)
```


We can use the `str_remove_all()` function which is a variant of the `str_remove()` function of the `stringr` package (which allows us to remove all occurrences of specified characters in each row rather than just the first occurrence, which is what `str_remove()` does), to remove the digit values, the `\r\n` characters and the punctuation marks from the column called `MHS_setting`.

Using the `mutate()` function we specify that we want to change this particular column and replace it with a version of this column that has removed characters that match digits, `r\n` or punctuation marks.

We need to specify that the character strings that should be used can be found in the `MHS_setting` column by using the `string =` argument and the patterns to find and remove are specified using the `pattern =` argument.

To allow us to look for all three of these patterns at the same time, we can use the `|` symbol between each pattern.


```{r}
table11.1a %<>%
mutate(MHS_setting = 
         str_remove_all(string = MHS_setting, 
                       pattern = "[:digit:]|\r\n|[:punct:]|"))

table11.1a
```

We also want to replace the spaces with a single space. We can see that sometimes there appears to be more than one space. We can specify that we want any occurrence of 1 or more  to be replaced by using the `{1,}` notation.

See here for an explanation of this on the cheat sheet:

```{r}
knitr::include_graphics(here::here("img", "quantifiers.png"))
```

So now we will use the `str_replace_all()` function of the `stringr` package.
In this case we also need to specify a replacement with the `replacement = ` argument.



```{r} 
table11.1a%<>%
mutate(MHS_setting = 
         str_replace_all(string = MHS_setting,
                        pattern = "[:blank:]{1,}", 
                    replacement = " "))

table11.1a
```

Now to finally remove the "a" values and the commas from the body of the table we can use `str_remove_all()` function yet again. However this time to specify that we want all columns except the first column called `MHS_setting`, we can use the `across()` function of the `dplyr` package. This allows us to specify what columns we want to mutate by using the `.cols = ` argument. We can select all columns except the first column called `MHS_setting` by using a minus sign `-` in front of the column name.



```{r}
table11.1a%<>%
mutate(dplyr::across(.cols = -MHS_setting,
                stringr::str_remove_all, "a|,"))

table11.1a
```

Our table is looking much better!

We also want to change our values to be numeric as opposed to character so that we can use them in mathematical functions. We can use the base `as.numeric()` function. Again we will use the `across()` function to indicate what variables we wish to mutate.

```{r}
table11.1a %<>%
  mutate(across(.cols =-MHS_setting, as.numeric))

table11.1a
```

We would also like to add a `type` and `subtype` variable, that specifies the general categories of settings where services were received, as well as remove a couple of rows that are completely empty. These are the rows where the first column values are `Genearl_Medicine` and `Juvenile_Justice`, and `Child Welfare`. If we look at the website, we can see that these were leading lines with no data.

```{r}
knitr::include_graphics(here::here("img", "table11.1a.png"))
```

First we will add the `type` and `subtype` variables using the `mutate` function.


```{r}
table11.1a %<>%
  mutate(type = c(rep("Specialty", 9), rep("Nonspecialty", 11))) %>%
  mutate(subtype =c("Specialty_total", 
                    rep("Outpatient", 5), 
                    rep("Inpatient", 3), 
                    "Nonspecialty_total", 
                    rep("Education", 3), 
                    rep("General_medicine", 2),
                    rep("Juvenile_Justice", 2),
                    rep("Child_Welfare", 2), 
                    "combination"))

```

We also want to add a variable with shorter names for labels in plots and statistical output.

```{r}
table11.1a %<>%
mutate(short_label = c("Specialty total", 
                       "Outpatient total", 
                       "Therapist", 
                       "Clinic", 
                       "Day program",
                       "In-home Therapist", 
                       "Inpatient total", 
                       "Hospital", 
                       "Residential Center",
                       "Nonspecialty total", 
                       "School total", 
                       "School Therapist", 
                       "School Program", 
                       "General Medicine",
                       "Family Dr",
                       "Justice System",
                       "Justice System",
                       "Welfare", 
                       "Fostercare", 
                       "Specialty Combination"))
```


We can remove the empty rows using the `filter()` function of the `dplyr` package. We can specify that we don't want to keep these rows by using the `!=` not equal to operator. 


```{r}
table11.1a %<>%
  dplyr::filter(MHS_setting != "General_Medicine") %>%
  dplyr::filter(MHS_setting != "Juvenile_Justice") %>%
  dplyr::filter(MHS_setting != "Child_Welfare")

table11.1a 
```


Finally, we would like to change the shape of our table so that we have a new column that represents the year and a new column that represents the value for that year. To do so we will be making our table "longer", meaning that it will have fewer rows and more columns.  See [here](https://en.wikipedia.org/wiki/Wide_and_narrow_data){target="_blank"} for more information about different table formats, typically referred to as wide and long or sometimes narrow.

We will use the `pivot_longer()` function of the `tidyr` package to change the shape of our table. 

There are 3 main arguments in this function:   
1) cols - which specifies what columns to collapse  
2) names_to - which specifies the name of the new column that will be created that will contain the column names of the columns you are collapsing  
3) values_to - which specifies the name of the new column that will be created that will contain the values from the columns you are collapsing 

To specify that we want to collapse all columns except the `MHS_setting` column we can again use the minus sign. Finally, we will make the `Year` variable numeric as well.


```{r}
table11.1a %<>%
  tidyr::pivot_longer(cols = contains("20"), 
                  names_to = "Year",
                 values_to = "Number") %>%
  mutate(Year = as.numeric(Year))

table11.1a
```

We can see that our table is now much longer - as we have `r dim(table11.1a)[1]` rows!

#### {.question_block}
<b><u> Question Opportunity </u></b>

Why do we have `r dim(table11.1a)[1]` rows now?

####

Now we see that the `Year` and `Number` variables are of class double because of the `<dbl>` under the column name.

Let's take a look at what the rest of the tables contain:


Table   | Details                                                                         
---|-------------
Table 11.1A       | Settings Where Mental Health Services Were Received in Past Year among Persons Aged 12 to 17: Numbers in Thousands, 2002-2018   
Table 11.1B       | Settings Where Mental Health Services Were Received in Past Year among Persons Aged 12 to 17: Percentages, 2002-2018  
Table 11.2A       |  Major Depressive Episode in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Numbers in Thousands, 2004-2018
Table 11.2B       | Major Depressive Episode in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Percentages, 2004-2018
Table 11.3A       | Major Depressive Episode with Severe Impairment in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Numbers in Thousands, 2006-2018
Table 11.3B       | Major Depressive Episode with Severe Impairment in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Percentages, 2006-2018
Table 11.4A       | Receipt of Treatment for Depression in Past Year among Persons Aged 12 to 17 with Major Depressive Episode in Past Year, by Demographic Characteristics: Numbers in Thousands, 2004-2018
Table 11.4B       | Receipt of Treatment for Depression in Past Year among Persons Aged 12 to 17 with Major Depressive Episode in Past Year, by Demographic Characteristics: Percentages, 2004-2018

OK, so the next table is very similar to Table11.1A, while the remaining tables have information about demographics.




As a reminder here are all of the steps that we performed to wrangle `table11.1a`:



```{r, eval = FALSE}
table11.1a %<>%
  dplyr::as_tibble() %>%
  slice(1:(n()-1))%>%
  dplyr::na_if("nc") %>%
  dplyr::na_if("--") %>%
  dplyr::na_if("") %>%
  dplyr::na_if("*")%>%
  dplyr::rename(MHS_setting = 
                  `Setting Where Mental Health ServiceWas Received`) %>%
  mutate(MHS_setting = 
         str_remove_all(string = MHS_setting, 
                       pattern = "[:digit:]|\r\n|[:punct:]|")) %>%
  mutate(MHS_setting = 
         str_replace_all(string = MHS_setting,
                        pattern = "[:blank:]{1,}", 
                    replacement = " ")) %>%
  mutate(dplyr::across(.cols = -MHS_setting,
                stringr::str_remove_all, "a|,")) %>%
  mutate(across(-MHS_setting, as.numeric)) %>%
  mutate(type = c(rep("Specialty", 9), rep("Nonspecialty", 11))) %>%
  mutate(subtype =c("Specialty_total", 
                    rep("Outpatient", 5), 
                    rep("Inpatient", 3), 
                    "Nonspecialty_total", 
                    rep("Education", 3), 
                    rep("General_medicine", 2),
                    rep("Juvenile_Justice", 2),
                    rep("Child_Welfare", 2), 
                    "combination")) %>%
  mutate(short_label = c("Specialty total", "Outpatient total", "Therapist", "Clinic", "Day program", "In-home Therapist", "Inpatient total", "Hospital", "Residential Center", "Nonspecialty total", "School total", "School Therapist", "School Program", "General Medicine", "Family Dr", "Justice System", "Justice System", "Welfare", "Fostercare", "Specialty Combination")) %>%
  dplyr::filter(MHS_setting != "General_Medicine") %>%
  dplyr::filter(MHS_setting != "Juvenile_Justice") %>%
  dplyr::filter(MHS_setting != "Child_Welfare") %>%
    tidyr::pivot_longer(cols = contains("20"), 
                  names_to = "Year",
                 values_to = "Number") %>%
   mutate(Year = as.numeric(Year))
```

Now we want to wrangle table11.1B which is formatted the most similarly. To do so we can simply run these steps on the using the `table11.1B` as the input. For the sake of education however, we will show you how you could make a function if we had several more similar tables to wrangle. This will also make it easier to write a function to wrangle the other demographic tables.

Last time we wrote a function in this case study, we only had one input in our function. This time we will have several inputs. We will have the table that we want to wrangle as `TABLE`, a new name for the first column called `new_col`, and an input called `pivot_col` which will be the name of the column that will be created after pivoting that will take the values from each of the years.

We will also add code to remove all rows that have only NA values, this means we don't need to know what rows ahead of time.

To do this we will use the `filter()` and `select()` functions of the `dplyr` package. 

We will calculate a sum of the count of `NA` values across the rows for the numeric columns (the columns for each year) using the base `rowSums()` function.

To do this we first select the columns that are numeric using: `select(., is.numeric)`, where the `.` refers to the table after all the previous wrangling steps in our function.

Then we get a true or false statement about which columns have `NA` values with the base `is.na()` function (this requires numeric values).


Then we calculate the sum using the base `rowSums()` function.


Altogether this looks like this: `rowSums(is.na(select(., is.numeric)))`.
Finally we compare this to the number of columns that are numeric by using: `length(select(., is.numeric)))`, with the idea that if the number of `NA` values is less than the number of columns that could have `NA` values, then we know it is not an empty row.
              
Note that if we were using the `summarise()` or `mutate()` function or the `dplyr` package, then we could use the `across()` function of the `dplyr` package to select what columns we wanted to use in our calculation.


```{r}
data_prep_settings <- function(TABLE, new_col, pivot_col){
  dplyr::as_tibble(TABLE) %>%
  slice(1:(n()-1))%>%
  na_if("nc") %>%
  na_if("--") %>%
  na_if("") %>%
  na_if("*") %>%
    rename({{new_col}} := names(.)[1]) %>%
     mutate({{new_col}} := 
         str_remove_all(string = pull(., {{new_col}}), 
                       pattern = "[:digit:]|\r\n|[:punct:]|")) %>%
  mutate({{new_col}} := 
         str_replace_all(string =pull(., {{new_col}}),
                        pattern = "[:blank:]{1,}", 
                    replacement = " ")) %>%
  mutate(dplyr::across(.cols = -{{new_col}},
                stringr::str_remove_all, "a|,")) %>%
     mutate(across(-{{new_col}}, as.numeric)) %>%
    mutate(type = c(rep("Specialty", 9), rep("Nonspecialty", 11))) %>%
     mutate(subtype =c("Specialty_total", 
                    rep("Outpatient", 5), 
                    rep("Inpatient", 3), 
                    "Nonspecialty_total", 
                    rep("Education", 3), 
                    rep("General_medicine", 2),
                    rep("Juvenile_Justice", 2),
                    rep("Child_Welfare", 2), 
                    "combination")) %>%
      mutate(short_label = c("Specialty total", "Outpatient total", "Therapist", "Clinic", "Day program", "In-home Therapist", "Inpatient total", "Hospital", "Residential Center", "Nonspecialty total", "School total", "School Therapist", "School Program", "General Medicine", "Family Dr", "Justice System", "Justice System", "Welfare", "Fostercare", "Specialty Combination")) %>%
     filter(rowSums(is.na(select(., is.numeric))) <
              length(select(., is.numeric))) %>%
  pivot_longer(cols = contains("20"), 
               names_to = "Year", 
               values_to = pivot_col)%>%
     mutate(Year = as.numeric(Year))
}
```

Now we can apply the function to `table11.1b`.

### **Table11.1b**

```{r}
table11.1b<- data_prep_settings(
                    TABLE = table11.1b,
                  new_col = "MHS_setting",
                pivot_col = "Percent")

table11.1b
```

Great!

What about the subsequent tables?

### **Demographic Tables**

All of the rest of the tables have demographic information and have this general structure:

```{r}
knitr::include_graphics(here::here("img", "dem_table.png"))
```

In these tables we have age groups in our first column so we don't want to remove digits or punctuation marks anymore so we need to modify our function a bit to remove that step. 

We also want to add the word `Age` and an underscore in front of the age group listed in the tables. We can use the `str_replace()` function of the `stringr` package, because now we want to only replace the first instance of `1` with `Age_1`.

We also plan to replace the first column name with `Demographic` for all of the tables.

We also want to create a new variable that list the subgroups.

We will also want to only wrangle the data up to the point that we change the shape of the data, so that we can check how the data looks first.


```{r}
data_dem_settings <- function(TABLE){
  dplyr::as_tibble(TABLE) %>%
  slice(1:(n()-1))%>%
  na_if("nc") %>%
  na_if("--") %>%
  na_if("") %>%
  na_if("*") %>%
    rename(Demographic := names(.)[1]) %>%
  mutate(Demographic := 
         str_replace_all(string =pull(., Demographic),
                        pattern = "[:blank:]{1,}", 
                    replacement = " ")) %>%
  mutate(Demographic = str_replace(string = Demographic, 
                                  pattern = "1", 
                              replacement = "Age_1")) %>%
 mutate(subgroup =c("Total",
                    rep("Age", 4), 
                    rep("Sex", 3), 
                    rep("Race/Ethnicity", 9)))%>%
  mutate(dplyr::across(.cols = contains("20"),
                stringr::str_remove_all, "a|,")) %>%
     mutate(across(contains("20"), as.numeric)) %>%
     filter(rowSums(is.na(select(., is.numeric))) <
              length(select(., is.numeric)))}
```

from Michael

```{r, eval=FALSE, echo=FALSE}
data_prep_dem <- function(TABLE, old_col, new_col, pivot_col){
  TABLE <- TABLE[-dim(TABLE)[1],]
  TABLE <- TABLE %>%
  na_if("nc") %>%
  na_if("--") %>%
  na_if("") %>%
  na_if("*")
  TABLE <- TABLE %>%
    as_tibble() %>%
    rename({{new_col}} := {{old_col}})
  partA <- TABLE %>%
    dplyr::select({{new_col}})
  partB <- TABLE %>%
    dplyr::select(-{{new_col}})
  partA <- partA %>%
  mutate({{new_col}} := partA %>%
           dplyr::select({{new_col}}) %>%
           pull({{new_col}}) %>%
           gsub("[\r\n]|[[:punct:]]|([[:blank:]])\\1+",
                        "", .))
  partA <- partA %>%
  mutate({{new_col}} := dplyr::case_when(stringr::str_detect(!!base::as.name(new_col), pattern = "1") ~ base::paste("Age",
                                                        stringr::str_sub(!!base::as.name(new_col),
                                                                start = 1,
                                                                end =2),
                                                        stringr::str_sub(!!base::as.name(new_col),
                                                                start = 3,
                                                                end = 4),
                                                        sep="_"),
                                 TRUE ~ !!base::as.name(new_col)))
  partB <- partB %>%
    mutate(across(.cols = everything(),
                str_remove_all, "a")) %>%
    mutate(across(.cols = everything(),
                str_remove_all, ","))
  rm(TABLE)
  TABLE <- bind_cols(partA,
                     partB)
  TABLE <- TABLE %>%
  pivot_longer(cols = contains("20"), names_to = "Year", values_to = pivot_col)
  TABLE
}
```

We use the function to wrangle the next set of tables. We will also add a column to describe what the data is about which will be helpful for merging the data later.
```{r}
table11.2a <- data_dem_settings(TABLE = table11.2a)
table11.2a %<>% mutate(data_type = "Major_Depressive_Episode")
glimpse(table11.2a)

table11.2b <- data_dem_settings(TABLE = table11.2b)
table11.2b %<>% mutate(data_type = "Major_Depressive_Episode")
glimpse(table11.2b)

table11.3a <- data_dem_settings(TABLE = table11.3a)
table11.3a %<>% mutate(data_type = "Severe_Major_Depressive_Episode")
glimpse(table11.3a)

table11.3b <- data_dem_settings(TABLE = table11.3b)
table11.3b %<>% mutate(data_type = "Severe_Major_Depressive_Episode")
glimpse(table11.3b)

table11.4a <- data_dem_settings(TABLE = table11.4a)
table11.4a %<>% mutate(data_type = "treatment")
glimpse(table11.4a)

table11.4b <- data_dem_settings(TABLE = table11.4b)
table11.4b %<>% mutate(data_type = "treatment")
glimpse(table11.4b)

```

Great! All of the demographic tables look good.

Now let's combine the count data (the "a" tables) and the percent data (the "b" tables) using the `bind_rows()` function of the `dplyr` package. Which will append each of the subsequent tables together.

We can use the `distinct()` function of the `dplyr` package to check that we indeed have all the data types now in these larger tibbles.

```{r}
counts <-bind_rows(table11.2a, table11.3a, table11.4a)
percents <-bind_rows(table11.2b, table11.3b, table11.4b)

counts %>%distinct(data_type)
percents %>%distinct(data_type)
```
Great!

Now we will reformat both the `counts` and `percents` data to be in the long format using `pivot_longer()` once again.

```{r}
counts %<>%
  pivot_longer(cols = contains("20"), 
               names_to = "Year", 
               values_to = "Number")%>%
     mutate(Year = as.numeric(Year))

percents %<>%
  pivot_longer(cols = contains("20"), 
               names_to = "Year", 
               values_to = "Percent")%>%
     mutate(Year = as.numeric(Year))
```

Now that we've wrangled the data, we can go ahead and proceed with our analysis. 

```{r, eval = FALSE, echo = FALSE}
save(percents, counts, table11.1a, table11.1b, file = here::here("data", "wrangled_data.rda"))
```

## **Data Analysis**
*** 

```{r, echo = FALSE}
load(file = here::here("data", "wrangled_data.rda"))
```
Recall what our main questions were:

#### {.main_question_block}
<b><u> Our main questions: </u></b>

1) How have depression rates in American youth changed since 2004, according to the NSDUH data?  How have rates differed between different youth subgroups (age, gender, ethnicity)?
2) Do mental health services appear to be reaching more youths? Again, how have rates differed between different youth subgroups (age, gender, ethnicity)?


####

We are specifically interested in how the frequency of recent major depressive episodes among youths differ compared to those in 2004. We are also interested how different groups differ. We will start with examining how male and females differ across time.

Since we have counts for the two sexes: males and females, and for the two years of interest, 2004 and 2018, we can conduct a [Pearson's chi-squared test](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test#:~:text=Pearson's%20chi%2Dsquared%20test%20is,differs%20from%20a%20theoretical%20distribution.){target="_blank"} for independence, which is also written like this: ${\chi}^2$. 

This will allow us to compare if the frequencies of major depressive episodes differs from what we would expect by chance if the variables years and sex were independent.

The null hypothesis is that the variables are independent or that the  difference in the proportions is equal to zero. Thus we want to test if sex and year are independent. In other words, we want to know if sex influences the number of major depressive episodes in 2004 and 2018. Or stated yet another way, Are major depressive episode counts across the two years associated with sex.  



According to [wikipedia](https://en.wikipedia.org/wiki/Chi-squared_test?oldformat=true){target="_blank"}: 

> Pearson's chi-square test is used to determine whether there is a statistically significant difference between the **expected frequencies** and the **observed frequencies** in one or more **categories** of a [contingency table](https://en.wikipedia.org/wiki/Contingency_table){target="_blank"}.


Thus, to conduct this statistical test, we need to produce a [contingency table](https://en.wikipedia.org/wiki/Contingency_table){target="_blank"}, which in this case could also be called a 2x2 table, which is the simplest kind of contingency table (only two levels for two variables).

Here is an example of another 2x2 table:


```{r, echo = FALSE}
contig <-tibble(Sex = c("Male", "Female", "Total"), Righthanded = c(41,47,88), Lefthanded = c(9,3, 12), Total = c(50,50,100))

contig
```

Here are all the observed values.

We can see that there are two variables: Sex and Handedness and each have two levels (male and female for sex and right and left for handedness). The table also includes totals of each of the 4 possible groups as well as the overall total.


Now let's create a table of expected values assuming sex and handedness are independent. 

The total right-handed is 88 out of 100 which is equal to 88%.
The total left-handed is 12  out of 100 which is equal to 12%.

Thus if each sex had the same amount of right-handedness and left-handedness, we would expect 88% of the males to be right-handed and 12 % to be left-handed; and
we would expect the exact same proportions of 88% right-handed and 12% left-handed for the females.

Since we have 50 males and 50 females and 12% of 50 is 6 and 88% of 50 is 44, we would thus have the following table of expected frequencies:

```{r, echo = FALSE}
contig <-tibble(Sex = c("Male", "Female", "Total"), Righthanded = c(44,44,88), Lefthanded = c(6,6, 12), Total = c(50,50,100))

contig

```


The ${\chi}^2$  is calculated like so:

$${\chi}^2=\sum_{j=1}^{m} \frac{(O_j - E_j)^2}{E_j}$$

Where $O_j = j^{th}$ observed count and $E_j = j^{th}$ expected count for
the jth cell of a contingency table with $m$ cells.

The degrees of freedom ($df$) is calculated like so: $df= (r-1)(c-1)$.

Where $r$ = # of rows and $c$ = the # of columns.


So to calculate the ${\chi}^2$ for handedness and sex manually, we can do the following for each of the four values in the table (besides the totals) like this:

${\chi}^2$ = fraction of the square difference of the observed minus the expected, divided by expected for right-handed males + the same for left-handed males + the same for right-handed females + the same for left-handed females

This is equal to :

$${\chi}^2 =  \frac{(41-44)^2}{44} + \frac{(9-6)^2}{6}+ \frac{(47-44)^2}{44}+ \frac{(3-6)^2}{6}$$

Which is equal to:

${\chi}^2$ = `r ((41-44)^2)/44 +((9-6)^2)/6 + ((47-44)^2)/44 + (3-6)^2/6`

Where the degrees of freedom  = $df = 92-1)(2-1) = 1$


What does this mean? We need to consult a chi-square distribution to determine what the p-value is and if this is significant.

```{r}
knitr::include_graphics("https://upload.wikimedia.org/wikipedia/commons/8/8e/Chi-square_distributionCDF-English.png")
```
#### [[source](https://en.wikipedia.org/wiki/Chi-squared_test#/media/File:Chi-square_distributionCDF-English.png){target="_blank"}

This [website](http://homepage.divms.uiowa.edu/~mbognar/applets/chisq.html){target="_blank"} has a simple way to check this, without requiring you to get out a ruler and consult this plot. Note on this website the notation for $df$ is ${\nu}$.

Plugging in 3.409091 as ${\chi}^2$ and 1 as ${\nu}$, we get a $p-value$ of 0.06484.

Thus we do not have enough evidence to reject the null hypothesis.  Therefore, we conclude that sex and handedness do appear to be independent.

See here for a more thorough explanation of the [chi-square test](https://www.ling.upenn.edu/~clight/chisquared.htm){target="_blank"}.



Now let's create a contingency table with our own data and see how we can implement this test in R.

It is critical that we use the counts data, and not the percentage data for our analysis, as the ${\chi}^2$ requires counts. 


We will filter the count data for the `Major_Depressive_Episode` data, as well as for the `Male` and `Female` data from `2004` and `2018`.


The following code subsets the data we need and makes the necessary manipulations so that the units of observation are appropriate. If we take a look at the title of the table we can see that the numbers in the table represent individuals by the thousands.

```{r}
knitr::include_graphics(here::here("img", "dem_table.png"))
```

```{r}
chi_square_11.2a <- counts %>%
  filter(data_type == "Major_Depressive_Episode")%>%
  filter(Year %in% c(2004, 2018)) %>%
  filter(Demographic %in% c("Male","Female")) %>%
  mutate(Number = Number * 1000)
```

The resulting object contains all the values we need for out contingency table. 

```{r}
chi_square_11.2a
```

 

A contingency table can  now be produced from this data which is in long format by transforming the data to wide format and re-purposing some values as row names. To reformat the data to wide format we can use the `pivot_wider()` function of the `tidyr` package.

For this function there are several important arguments:  
1) names_from - this is the variable where the names of new columns will come from
2) values_from - this is the variable where the values for the new columns will come from
3) names_prefix - if we want to add a prefix to the new columns we can do so using this argument

So in our case we want to spread out the year data into two columns thus the names will come from the `Year` variable and the values will come from the `Number` variable. We want to add the word `Year` to the new columns. We also want the remove the `subgroup` and `data_type` variables and only keep the `Demographic`, `Year`, and `Number` variables. To do so we can use the `select()` function. Fin

```{r}
chi_square_11.2a %<>%
  select(Demographic, Year, Number) %>%
  tidyr::pivot_wider(
                     names_from = Year,
              names_prefix = "Year", 
              values_from = Number)
chi_square_11.2a

```

Now we can use the `column_to_rownames()` function of the `tibble` package to make the `Demographic` variable levels the row names. Otherwise we would need to select the numeric columns to perform the stats test.

```{r}
chi_square_11.2a %<>%
  tibble::column_to_rownames("Demographic")

chi_square_11.2a
```

Note that a contingency table would also have totals for all groups as well, but this is not necessary for the `stats::chisq.test()` function.


The chi-squared test for independence can be conducted using the `stats::chisq.test()` function. 

```{r}
stats::chisq.test(chi_square_11.2a)
```

We see that the p-value is very small, which suggests that sex does influence the count of major depressive episodes across time. 

How about for severe major depressive episodes?


```{r}
chi_square_11.3a <- counts %>%
  filter(data_type == "Severe_Major_Depressive_Episode")%>%
  filter(Year %in% c(2009, 2018)) %>%
  filter(Demographic %in% c("Male","Female")) %>%
  mutate(Number = Number * 1000)

chi_square_11.3a <- chi_square_11.3a %>%
  select(-data_type) %>%
  select(-subgroup) %>%
  pivot_wider(names_from = Year,
              names_prefix = "Year", 
              values_from = Number) %>%
  column_to_rownames("Demographic")

chi_square_11.3a
```

```{r}
chisq.test(chi_square_11.3a)
```
There also appears to be an influence of sex on the rate of severe major depressive episodes across the years.


How about treatment?


```{r}
chi_square_11.4a <- counts %>%
  filter(data_type == "treatment")%>%
  filter(Year %in% c(2009, 2018)) %>%
  filter(Demographic %in% c("Male","Female")) %>%
  mutate(Number = Number * 1000)

chi_square_11.4a <- chi_square_11.4a %>%
  select(-data_type) %>%
  select(-subgroup) %>%
  pivot_wider(names_from = Year,
              names_prefix = "Year", 
              values_from = Number) %>%
  column_to_rownames("Demographic")

chi_square_11.4a
```

```{r}
chisq.test(chi_square_11.4a)
```
There also appears to be an influence of sex on the rate at which youths received treatment across the two years.


avocado.. I think it makes sense to move the visualization section first... and use it to motivate that analysis from the year 2010 and of male and females...

## **Data Visualization**
*** 

```{r, echo = FALSE}
load(file = here::here("data", "wrangled_data.rda"))
```

OK, so now we are going to make some visualizations to further explore our questions of interest.

We are going to use the `ggplot2` package to create our plots.

<details><summary> Click here for an introduction about this package if you are  new to using `ggplot2` </summary>

The [ggplot2 package](http://ggplot2.tidyverse.org) is generally intuitive for beginners because it is based on a  [grammar of graphics](http://vita.had.co.nz/papers/layered-grammar.html) or the `gg` in `ggplot2`. The idea is that you can construct many sentences by learning just a few nouns, adjectives, and verbs. There are specific “words” that we will need to learn and once we do, you will be able to create (or “write”) hundreds of different plots.

The critical part to making graphics using `ggplot2` is the data needs to be in a _tidy_ format. 
Given that we have just spent time putting our data in _tidy_ format, we are primed to take advantage of all that `ggplot2` has to offer! 

We will show how it is easy to pipe _tidy_ data (output) as input to other functions that create plots. 
This all works because we are working 
within the _tidyverse_. 


**What is the `ggplot()` function?** 
As explained by Hadley Wickham:

> The grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinates system.

 `ggplot2` Terminology 
* **ggplot** - the main function where you specify the dataset and variables to plot (this is where we define the `x` and
`y` variable names)
* **geoms** - geometric objects
    * e.g. `geom_point()`, `geom_bar()`, `geom_line()`, `geom_histogram()`
* **aes** - aesthetics
    * shape, transparency, color, fill, line types
* **scales** - define how your data will be plotted
    * continuous, discrete, log, etc

The function `aes()` is an aesthetic mapping function inside the `ggplot()` object. 
We use this function to specify plot attributes (e.g. `x` and `y` variable names) that will not change as we add more layers.  

Anything that goes in the `ggplot()` object becomes a global setting. 
From there, we use the `geom` objects to add more layers to the base `ggplot()` object. 
These will define what we are interested in illustrating using the data.

</details>

### Major Depressive Episodes in past year

We will start by taking a look at the rate of major depressive episodes among youths over time of the various demographic groups.

OK, we will start out by using the `ggplot()` function to specify what data we would like to plot on each axis. We will also indicate that would like to use the `Demographic` variable to group our data and to color our data.  This is our first layer of the plot, thus for subsequent layers we need to use a plus sign `+`. 

Next we will use the `geom_line()` function of the `ggplot2` package to specify that we would like to create a line plot. 

Then we will add labels for the title and subtitle using the `labs()` function of the `ggplot2` package. 

Finally we will move our legend to the bottom of the plot using the `theme()` function which helps us control various details about our plot. 

```{r}
percents %>%
    filter(data_type == "Major_Depressive_Episode")%>%
ggplot2::ggplot(aes(x = Year, 
                    y = Percent, 
                group = Demographic, 
                color = Demographic))+
  geom_line(size = 1) +
  labs(title = "Major Depressive Episode in Past Year\namong Persons Aged 12 to 17",
       subtitle = "By Demographic Characteristics, Percentages, 2004-2018") +
  theme(legend.position = "bottom")
```
 This plot very difficult to read because there are so very many groups. 
Now let's look at just the total across time. We can do so by first filtering our data for total values. 

It would also nice to include every year in the x-axis. We can do so by using the `scale_x_continuous()` function which gives us greater control about how the x-axis is displayed.

Finally, we will drop the legend since we will only have one group using `legend.position = "none"` and we can change the angle of the x-axis text using `axis.text.x = element_text(angle = 90)` within the `theme()` function.

We will also make the line thicker using the `size =` argument for the `geom_line()` function.

The `theme_linedraw()` function changes the aesthetics of the plot. See [here](https://ggplot2.tidyverse.org/reference/ggtheme.html){target="_blank"} for a list of options.

```{r}
MDE_total <-percents %>%
  filter(data_type == "Major_Depressive_Episode")%>%
  filter(Demographic == "TOTAL") %>%
  ggplot(aes(x = Year, 
             y = Percent, 
         group = Demographic, 
         color = Demographic)) +
  geom_line(aes(color = Demographic), size = 1.5) +
  scale_x_continuous(breaks = seq(2004, 2018, by=1),
                     labels = seq(2004, 2018, by=1),
                     limits = c(2004, 2018)) +
  labs(title = "Percent of Persons Aged 12 to 17 Reporting Having a \n Major Depressive Episode in the Past Year ")+
  theme_linedraw()+
  theme(axis.text.x = element_text(angle = 90),
        legend.position = "none") 

MDE_total
```

We can see that there is a steep increase after around 2011:

Let's add a different background color for the years since 2011.
We can do so by adding a `geom_rect()` layer before we plot the line.
We just need to specify the location of the rectangle on our plot.

```{r}
MDE_total <-percents %>%
  filter(data_type == "Major_Depressive_Episode")%>%
  filter(Demographic == "TOTAL") %>%
  ggplot(aes(x = Year, y = Percent, group = Demographic)) +
  geom_rect(xmin = 2011, xmax = Inf,  
            ymin = -Inf, ymax = Inf,  
            fill = "light gray")+
  geom_line(aes(color = Demographic), size = 1.5) +
  scale_x_continuous(breaks = seq(2004, 2018, by=1),
                     labels = seq(2004, 2018, by=1),
                     limits = c(2004, 2018)) +
  labs(title = "The Rate of Youths Aged 12 to 17 Reporting Having a \n Major Depressive Episode in the Past Year is Increasing")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 90),
        legend.position = "none") 

MDE_total

```

Now let's look at group differences.  

To make sure our plot is not too overwhelming, let's limit to only age and sex subgroups. Thus we will filter out the data about totals and different racial/ethnic groups for now. We will also use the `facet_wrap()` function to make subplots based on the demographic categories, which we put in a variable called `subgroups` earlier when we wrangled the data.

```{r}
MDE_age_sex <-percents %>%
  filter(data_type == "Major_Depressive_Episode")%>%
  filter(subgroup != "Race/Ethnicity") %>%
  filter(Demographic != "TOTAL") %>%
  ggplot(aes(x = Year, 
             y = Percent, 
         group = Demographic, 
         color = Demographic)) +
  geom_line(aes(color = Demographic), size =1) +
  scale_x_continuous(breaks = seq(2004, 2018, by=1),
                     labels = seq(2004, 2018, by=1),
                     limits = c(2004, 2018)) +
  labs(title = "Major Depressive Episode in Past Year\namong Persons Aged 12 to 17",
       subtitle = "By Demographic Characteristics, Percentages, 2004-2018") +
  facet_wrap(~subgroup)+
  theme_linedraw()+
  theme(strip.text = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(angle = 90)) 

MDE_age_sex
```

Nice! Now it is much easier to tell how each group has changed over time.

We can also add labels directly to the lines using the `directlabels` package. There are several methods to do so. See [here](http://directlabels.r-forge.r-project.org/docs/index.html){target="_blank"} for more information about the options for adding labels with this package.  We will use the `"far.from.others.borders"` method so that our labels don't overlap one another. We will also use `dl.trans()` of the `directlabels` package to move the labels slightly upward (`y = y +0.35`) and to the left (`x = x -0.1`). We will use the `dl.move()` function of the `directlabels` package to move one of the labels to a particular location. 


```{r}
directlabels::direct.label(MDE_age_sex, list(
  dl.trans(y = y +0.38, x = x -0.1), 
  "far.from.others.borders",
  cex = .8, 
  dl.move("Age_14-15", x = 2008, y =10)))

```

This looks very clear now!

We can see that the majority of individuals that reported experiencing a major depressive episode in the past year were in an older age bracket. We can also see that the trend has been increasing for all three age brackets since roughly 2011.

We can also see an increase for both sexes since about 2011, but there is a steeper increase for females. Furthermore Females have a much higher rate than males for all years.

Let's make the same plot with a different shaded background for the years of the increase like we did for the total plot.

```{r}
MDE_age_sex <-percents %>%
  filter(data_type == "Major_Depressive_Episode")%>%
  filter(subgroup != "Race/Ethnicity") %>%
  filter(Demographic != "TOTAL") %>%
  ggplot(aes(x = Year, y = Percent, group = Demographic)) +
  geom_rect(xmin = 2011, xmax = Inf,  
            ymin = -Inf, ymax = Inf,  
            fill = "light gray")+
  geom_line(aes(color = Demographic), size =1) +
  scale_x_continuous(breaks = seq(2004, 2018, by=1),
                     labels = seq(2004, 2018, by=1),
                     limits = c(2004, 2018)) +
  labs(title = "Major Depressive Episode in Past Year\namong Persons Aged 12 to 17",
       subtitle = "By Demographic Characteristics, Percentages, 2004-2018") +
  facet_wrap(~subgroup)+
  theme_linedraw()+
  theme(strip.text = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(angle = 90)) 

MDE_age_sex <-direct.label(MDE_age_sex, list(
  dl.trans(y = y +0.38, x = x -0.2), 
  "far.from.others.borders", 
  cex = .8, 
  dl.move("Age_14-15", x = 2008, y =10)))
MDE_age_sex
```

So what might be accounting for this?


This cross-cultural [review article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3330161/){target="_blank"} published in 2012 suggests that aspects related to life-style due to modernity may be causing increased depression rates: 

> Modern populations are increasingly overfed, malnourished, sedentary, sunlight-deficient, sleep-deprived, and socially-isolated. These changes in lifestyle each contribute to poor physical health and affect the incidence and treatment of depression.

And although this may be true globally, the US has been arguably experiencing these modern lifestyle changes for years prior to this steep increase in 2011.

So what might have happened in the US around this time?

There is large amount of literature about how the use of smart phone and social media may have lead to increased depression rates. See this [article](https://childmind.org/article/is-social-media-use-causing-depression/){target="_blank"} which links to many scientific articles on the subject. 

This has been a controversial topic due to conflicting findings, likely due to focus on different sites and aspects of social media across different studies. 

The article points out that the true relationship between social media use and depression may be nuanced. Some individuals who have high face-to-face levels of interaction or lack of the opportunity to interact with others face-to-face (due to various barriers like geography), may actually experience lower risk for depression with more social media use. 

Furthermore, different social media platforms may vary for their influence on depression.

Instagram was released in 2010 (which is around when our plot starts to show the upward increase in major depressive episodes, particularly in females) and according to the article:

> Image-driven Instagram shows up in surveys as the platform that most leads young people to report feeling anxiety, depression and worries about body image.

Furthermore, it may be secondary effects of social media use, like less physical activity or less sleep that may increase the risk for depression. 

Now let's take a look at how the rate of major depressive episodes has changed across time for different racial/ethnic groups.

Since we have so many groups, we probably don't want to directly label the lines this time. Instead, we will rely on the legend that will be automatically created.

We will use the the `fct_reorder` function of the `forcats` package to order the racial/ethnic groups in the legend based on the last value (using `tail()`) of the `Percent` variable.

We will also manually color our lines based on a color palette called [viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html){target="_blank"}, which is more discernible for individuals with color-blindness. To do so we will use the `scale_color_viridis_d()` function of the `ggplot2` package, which is intended for coloring discrete values.

```{r}

MDE_race <-percents %>%
  filter(data_type == "Major_Depressive_Episode")%>%
  filter(subgroup == "Race/Ethnicity") %>%
     mutate(Demographic = fct_reorder(Demographic, 
                                      Percent, 
                                      tail, 
                                      n = 1, 
                                      .desc = TRUE)) %>%
    ggplot(aes(x = Year, 
               y = Percent,
           group = Demographic)) +
  geom_rect(xmin = 2011, xmax = Inf,   
            ymin = -Inf, ymax = Inf,   
            fill = "light gray")+
  geom_line(aes(color = Demographic), size =1)+
  facet_wrap(~subgroup)+
  scale_x_continuous(breaks = seq(2004, 2018, by=1),
                     labels = seq(2004, 2018, by=1),
                     limits = c(2004, 2018)) +
  scale_color_viridis_d()+
  labs(title = "Major Depressive Episode in Past Year\namong Persons Aged 12 to 17",
       subtitle = "By Demographic Characteristics, Percentages, 2004-2018") +
  theme_linedraw()+
  theme(strip.text = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(angle = 90)) 

MDE_race

```

Unfortunately, there is only one value for `NHOPI` group, thus since this is a line plot, we do not have enough points(2 at minimum) to create a line, so lets remove this group from the plot to remove the group from the legend.

```{r}
percents %>%
  filter(data_type == "Major_Depressive_Episode")%>%
  filter(Demographic =="NHOPI")

MDE_race <-  percents %>%
  filter(data_type == "Major_Depressive_Episode")%>%
  filter(subgroup == "Race/Ethnicity") %>%
    filter(Demographic != "NHOPI") %>%
     mutate(Demographic = fct_reorder(Demographic,
                                      Percent, 
                                      tail, 
                                      n = 1, 
                                      .desc = TRUE)) %>%
   ggplot(aes(x = Year, 
              y = Percent, 
          group = Demographic)) +
  geom_rect(xmin = 2011, xmax = Inf,  
            ymin = -Inf, ymax = Inf,   
            fill = "light gray")+
  geom_line(aes(color = Demographic), size =1)+
  facet_wrap(~subgroup)+
  scale_x_continuous(breaks = seq(2004, 2018, by=1),
                     labels = seq(2004, 2018, by=1),
                     limits = c(2004, 2018)) +
  labs(title = "Major Depressive Episode in Past Year\namong Persons Aged 12 to 17",
       subtitle = "By Demographic Characteristics, Percentages, 2004-2018") +
  theme_linedraw()+
  theme(strip.text = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(angle = 90)) +
  scale_color_viridis_d()

MDE_race

```


Census Bureau definitions:  
1) **AIAN** stands for American Indian and Alaska Native  
2) **NHOPI** stands for Native Hawaiian or Other Pacific Islander  

We can see that the group of individuals who reported as being two or more races, had the highest percentages of having a major depressive episode in the past year. Those group who reported as Black or African American had the lowest percentages. However, we can see that most of the racial/ethnic groups are fairly similar and we see an increasing for most groups since around 2011. Keep in mind the limitations listed in the [**Limitations**] section as you view these findings. It is possible that this group may be less likely to report experiencing symptoms of depression.

### Major Depressive Episode with Severe Impairment

Now let's take a look at how the rate of youths reporting having a major depressive episode with severe impairment has changed over time. See the [**What are the data?**] section about how severe impairment was defined. 

```{r}
MDES <-percents %>%
  filter(data_type == "Severe_Major_Depressive_Episode")%>%
  filter(subgroup != "Race/Ethnicity") %>%
  filter(Demographic != "TOTAL") %>%
    ggplot(aes(x = Year, 
               y = Percent,
           group = Demographic)) +
  geom_rect(xmin = 2011, xmax = Inf,  
            ymin = -Inf, ymax = Inf,  
            fill = "light gray")+
  geom_line(aes(color = Demographic), size =1)+
  scale_x_continuous(breaks = seq(2006, 2018, by=1),
                     labels = seq(2006, 2018, by=1),
                     limits = c(2006, 2018)) +
  labs(title = "Major Depressive Episode with Severe Impairment in Past Year\namong Persons Aged 12 to 17",
       subtitle = "By Demographic Characteristics, Percentages, 2004-2018") +
  facet_wrap(~subgroup)+
  theme_linedraw()+
  theme(strip.text = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(angle = 90)) 

direct.label(MDES, list(
  dl.trans(y = y +0.39, x = x -0.1), 
  "far.from.others.borders",
  cex = .8, 
  dl.move("Age_14-15", x= 2016.5, y = 11)))


```


We can see that the majority of individuals that reported experiencing a major depressive episode with severe impairment were in an older age bracket, however there appears to be a more dramatic change in the middle age group from 2011-2012. We can see a very steep increase in the data for the females after 2011, again this is much more steep than the increase seen for males over time.

Now let's look at racial/ethnic groups.

```{r}
Race_MDES <-  percents %>%
  filter(data_type == "Severe_Major_Depressive_Episode")%>%
  filter(subgroup == "Race/Ethnicity") %>%
    mutate(Demographic = fct_reorder(Demographic, 
                                     Percent, 
                                     tail,
                                     n = 1, 
                                     .desc = TRUE)) %>%
    ggplot(aes(x = Year,
               y = Percent,
           group = Demographic)) +
  geom_rect(xmin = 2011, xmax = Inf,  
            ymin = -Inf, ymax = Inf,  
            fill = "light gray")+
  geom_line(aes(color = Demographic), size =1)+
  facet_wrap(~subgroup)+
  scale_x_continuous(breaks = seq(2006, 2018, by=1),
                     labels = seq(2006, 2018, by=1),
                     limits = c(2006, 2018)) +
  labs(title = "Major Depressive Episode with Severe Impairment in Past Year\namong Persons Aged 12 to 17",
       subtitle = "By Demographic Characteristics: Percentages, 2006-2018")+
  theme_linedraw()+
  theme(strip.text = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(angle = 90)) +
  scale_color_viridis_d()


Race_MDES

```

Census Bureau definitions:  
1) **AIAN** stands for American Indian and Alaska Native  
2) **NHOPI** stands for Native Hawaiian or Other Pacific Islander 

We see similar trends as we saw for the previous racial/ethnic group plot. The rate is highest for those who are two are more races and lowest for those who are Black or African American. The data for the AIAN group is sparse, so it is unclear if their levels would be lowest on the last year.

### Receipt of treatment for depression for youths who had a major depressive episode

Now we will take a look at those who received treatment. First let's look overall.

```{r}
Treat_total <-percents %>%
  filter(data_type == "treatment")%>%
  filter(Demographic == "TOTAL") %>%
  ggplot(aes(x = Year, 
             y = Percent, 
             group = Demographic)) +
  geom_rect(xmin = 2011, xmax = Inf,  
            ymin = -Inf, ymax = Inf,   
            fill = "light gray")+
  geom_line(aes(color = Demographic), size = 1.5) +
  #geom_smooth(method = "loess", se = FALSE)+
  scale_x_continuous(breaks = seq(2004, 2018, by=1),
                     labels = seq(2004, 2018, by=1),
                     limits = c(2004, 2018)) +
  labs(title = "The Rate of Youths Aged 12 to 17 Reporting Having a \n Major Depressive Episode in the Past Year is Increasing")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 90),
        legend.position = "none") 

Treat_total

```

Overall roughly 40 percent of youths who self-reported experiencing a major depressive episode, also received treatment for depression. 

This rate is increasing overtime like the trend of those who had a major depressive episode, yet the data is much more variable from one year to the next. 

```{r}
treat<-percents %>%
  filter(data_type == "treatment")%>%
  filter(subgroup != "Race/Ethnicity")%>%
  filter(subgroup != "Total")%>%
  ggplot(aes(x = Year, 
             y = Percent, 
             group = Demographic, 
             color = Demographic)) +
  geom_line( size =1) +
  facet_wrap(~subgroup)+
  scale_x_continuous(breaks = seq(2004, 2018, by=1),
                     labels = seq(2004, 2018, by=1),
                     limits = c(2004, 2018)) + 
  labs(title = "Receipt of Treatment for Depression in Past Year among\nPersons Aged 12 to 17 with Major Depressive Episode in Past Year",
       subtitle = "By Demographic Characteristics: Percentages, 2004-2018")+
  theme_linedraw()+
  theme(strip.text = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(angle = 90)) 

treat<-direct.label(treat, list(
  dl.trans(y = y -1.5, x = x -0.4),
  "far.from.others.borders", 
  cex = .8, 
  dl.move("Age_16-17", x = 2010, y = 42),
  dl.move("Age_14-15", x = 2015, y = 38)))

treat
```

There seems to be an upward trend, but it isn't nearly as much as the trend we saw for the increase of major depressive episodes. In general, the data seems to vary much more as well.


```{r}
Race_treat <-percents %>%
  filter(data_type == "treatment")%>%
  filter(subgroup == "Race/Ethnicity") %>%
  mutate(Demographic = fct_reorder(Demographic, 
                                   Percent, 
                                   tail,
                                   n = 1, 
                                   .desc = TRUE)) %>%
  ggplot(aes(x = Year,
             y = Percent, 
         group = Demographic, 
         color = Demographic)) +
  geom_line(size =1) +
  scale_x_continuous(breaks = seq(2009, 2018, by=1),
                     labels = seq(2009, 2018, by=1),
                     limits = c(2009, 2018)) +
  labs(title = "Receipt of Treatment for Depression in Past Year among\nPersons Aged 12 to 17 with Major Depressive Episode",
       subtitle = "By Demographic Characteristics: Percentages, 2004-2018")+
  theme_linedraw()+
  theme(strip.text = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(angle = 90)) +
    scale_color_viridis_d()

Race_treat

```
It looks as though youths who report as being white received the most care from mental health services.


### Mental Health Services

We will also take a look at where youths are receiving treatment by using values from table11.1b which has the percentage values for counts presented in table11.1a.


We can use the `str_detect()` function of the `stringr` package to filter for the values of the `short_label` variable that has the word `total` in it. 

```{r}
plotMHS <-table11.1b %>%
  filter(stringr::str_detect(short_label, "total") ) %>%
  ggplot(aes(x = Year, 
             y = Percent, 
         group = MHS_setting, 
         color = short_label)) +
  geom_line(size = 1) +
  facet_wrap(~type)+
  ggplot2::scale_x_continuous(breaks = seq(2009, 2018, by=1),
                     labels = seq(2009, 2018, by=1),
                     limits = c(2009, 2018)) +
  ggplot2::labs(title = "Settings Where Mental Health Services Were Received in Past Year\namong Persons Aged 12 to 17",
       subtitle = "Percentages, 2002-2018")+
  theme_linedraw()+
  theme(strip.text = element_text(size = 14, face = "bold"),
         axis.text.x = element_text(angle = 90)) 

plotMHS <-direct.label(plotMHS, list(
  dl.trans(y = y +0.35, x = x -0.1),
  "far.from.others.borders", 
  cex = .8, 
  dl.move("Outpatient total", x = 2015, y =11)))

plotMHS
```

We can see that youths appear to be receiving care in nonspecialty facilities at a slightly higher rate than that of specialty facilities. However, the rates appear to be very similar and the relative differences appear to be consistent across time.

Let's take a look at subcategories of mental health services.
So now we will filter for values within the `short_label` variable that do not contain the word "total" by using a `!` in front of the `str_dectect` statement.

```{r}
plotMHSS <-table11.1b %>%
  filter(!str_detect(short_label, "total")) %>%
  ggplot(aes(x = Year, 
             y = Percent, 
         group = MHS_setting, 
         color = short_label)) +
  geom_line(size = 1) +
  facet_wrap(~ type)+
  scale_x_continuous(breaks = seq(2002, 2019, by=1),
                     labels = seq(2002, 2019, by=1),
                     limits = c(2002, 2019)) +
  labs(title = "Settings Where Mental Health Services Were Received in Past Year\namong Persons Aged 12 to 17",
       subtitle = "Percentages, 2002-2018")+
  theme_linedraw()+
  theme(strip.text = element_text(size = 14, face = "bold"),
         axis.text.x = element_text(angle = 90)) 

plotMHSS <-direct.label(plotMHSS, list(
  dl.trans(y = y +0.3),
  "far.from.others.borders", 
  dl.move("School Therapist", 2010, 10), 
  dl.move("Fostercare", 2010, 1), 
  dl.move("Therapist", x=2009, y = 10.5)))
plotMHSS
```



OK, so now we know how the rates of different subgroups compare for having a major depressive episode in the past year, having a major depressive episode with severe impairment, and receiving treatment. We also know where youths are typically getting treatment. But how do the rates of having a major depressive episode in the past year, having a major depressive episode with severe impairment, and receiving treatment compare within each group?

### Overall outcomes by group

```{r}
sex_outcomes <-percents %>%
  filter(Demographic %in% c("Male", "Female", "TOTAL")) %>%
  ggplot(aes(x = Year, 
             y = Percent, 
         color = Demographic)) +
  geom_line(aes(linetype= data_type), size = 1) +
  scale_x_continuous(breaks = seq(2004, 2018, by=1),
                     labels = seq(2004, 2018, by=1),
                     limits = c(2004, 2018)) +
  labs(title = "Major Depressive Episode in Past Year\namong Persons Aged 12 to 17",
       subtitle = "By Demographic Characteristics, Percentages, 2004-2018") +
  facet_wrap(~Demographic, strip.position = "top")+
  theme_linedraw()+
  theme(strip.text = element_text(size = 8, face = "bold"),
       axis.text.x = element_text(angle = 90),
      legend.title = element_blank(),
   legend.position = "bottom") +
  guides(color = FALSE)

sex_outcomes 
```

We can see that a large portion of individuals experiencing a major depressive episode have an episode with severe impairment for each group. Females have a higher rate of both types of episode and of treatment. Although females have more than double the rate of reported episodes, they receive a relatively similar rate of treatment as males for depression. This suggests that females are either more likely than males to self-report depression symptoms in surveys, or females may not be receiving as much care despite the larger need. 

```{r}
age_outcomes <-percents %>%
  filter(subgroup == "Age") %>%
  #filter(Demographic != "NHOPI") %>%
  ggplot(aes(x = Year, 
             y = Percent, 
         color = Demographic)) +
  geom_line(aes(linetype= data_type), size = 1) +
  scale_x_continuous(breaks = seq(2004, 2018, by=1),
                     labels = seq(2004, 2018, by=1),
                     limits = c(2004, 2018)) +
  labs(title = "Major Depressive Episode in Past Year\namong Persons Aged 12 to 17",
       subtitle = "By Demographic Characteristics, Percentages, 2004-2018") +
  facet_wrap(~Demographic, strip.position = "top")+
  theme_linedraw()+
  theme(strip.text = element_text(size = 8, face = "bold"),
       axis.text.x = element_text(angle = 90),
      legend.title = element_blank(),
   legend.position = "bottom")+
  guides(color = FALSE)

age_outcomes 

```
All age groups show a similar ratio of severe major depressive episodes for those that experienced an episode. 
```{r, fig.height=10}
race_outcomes <-percents %>%
  filter(subgroup == "Race/Ethnicity") %>%
  filter(Demographic != "NHOPI") %>%
  ggplot(aes(x = Year, 
             y = Percent, 
         color = Demographic)) +
  geom_line(aes(linetype= data_type), size=1) +
  scale_x_continuous(breaks = seq(2004, 2018, by=1),
                     labels = seq(2004, 2018, by=1),
                     limits = c(2004, 2018)) +
  labs(title = "Major Depressive Episode in Past Year\namong Persons Aged 12 to 17",
       subtitle = "By Demographic Characteristics, Percentages, 2004-2018") +
  facet_wrap(~Demographic, strip.position = "top", nrow = 4)+
  theme_linedraw()+
  theme(strip.text = element_text(size = 8, face = "bold"),
       axis.text.x = element_text(angle = 90),
      legend.title = element_blank(),
   legend.position = "bottom")+
  guides(color = FALSE)
race_outcomes

```

All racial and ethnic groups also show a similar rate of severe episodes relative to general episode rate. The rate of receiving treatment is fairly similar relative to the percentage of youths that reported having symptoms for each group.

## **Summary**
*** 

### Summary Plot

Let's make a plot that summarizes our major findings.

We will use the `ggdraw()` function of this package. This allows you to add labels and other plot aspects on top of existing plots. Thus if we want to add a title element to our overall plot that we will add to a combined plot of our existing plots we can use `ggdraw()` to start and then the `draw_label()` function to add text.

```{r}
title_plots <- ggdraw() + 
  draw_label(
    "Self-Reported Depression Among American Youths",
    fontface = 'bold',
    size=18,
    x = 0,
    hjust = -0.01
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0.5)
  )
```

We can also make a subtitle in the same way.

```{r}

forward <- ggdraw() + 
  draw_label(
    "The percentage of youths experiencing major depressive episodes and receiving treatment\n for depression has increased especially since 2011",
    fontface = 'bold',
    size=14,
    x = 0,
    hjust = -0.01
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )



```

Now we will modify some of our existing plots using the `theme()` function as we did before to remove the x-axis title, to change the color of the axis text and the title size and color, as well as change the titles of the plots. 

```{r}

MDE_total_for_mp <-MDE_total +
  theme(plot.title = element_text(size = 14, 
                                 color = "black"),
     plot.subtitle = element_blank(),
      axis.title.x = element_blank(),
         axis.text = element_text(color = "black")) +
  labs(title = "The percent of youths who reported having a\nmajor depressive episode") 

treat_for_mp <- Treat_total +
  theme(plot.title = element_text(size = 14, 
                                 color = "black"),
     plot.subtitle = element_blank(),
      axis.title.x = element_blank(),
         axis.text = element_text(color = "black")) +
  labs(title = "The percent of youths who reported receiving\ntreatment for depression", subtitle = "", x = "")


MDE_age_sex_for_mp <- MDE_age_sex +
  theme(plot.title= element_text(size = 14, 
                                color = "black"),
        plot.subtitle = element_blank(),
        axis.title.x = element_blank()) +
  labs(title = "Females and older youths have highest rates and show\nthe steepest increase") 

```

For this last plot we also want to get the legend and save it as a separate object so that we can add it to our plot grid in a way that doesn't shrink the size of our plot to accommodate the legend. We can use the `get_legend()` function of the `cowplot` package to do this. However, beforehand, we also want to change the way the legend is displayed. We can use the `guides()` function of the `ggplot2` package to modify the legend to specify that we want the legend to be displayed in 2 rows like so `guides(guide_legend(nrow = 2))`.
```{r}
MDE_race_for_mp_leg <- MDE_race +
  theme(plot.title= element_text(size = 14, 
                                color = "black"),
        plot.subtitle = element_blank(),
        axis.title.x = element_blank(),
        legend.position = "bottom", 
        legend.title = element_blank(),
        legend.text = element_text(size = 14)) +
  labs(title = "All racial/ethnic groups show similar\nincreases since 2011") +
  guides(guide_legend(nrow = 2))

legend <-get_legend(MDE_race_for_mp_leg)

```

Now we will remove the legend for this plot:
```{r}
MDE_race_for_mp <-MDE_race_for_mp_leg +
  theme(legend.position = "none")

```

OK, now we are ready to start putting our plots together using the `plot_grid()` function of the `cowplot` package. 

It is helpful to first make rows by combining the plots that we want to be displayed next to each other and then combining these with the title and subtitle, called `forward`. 

We can use the `rel_widths` argument to modify how wide each plot is displayed.

```{r}
row_1 <- plot_grid(MDE_total_for_mp,
                   treat_for_mp,
                   nrow = 1)
row_2 <- plot_grid(MDE_age_sex_for_mp,
                   MDE_race_for_mp,
                   nrow = 1, rel_widths = c(1.2, 0.8))

```


Now we can combine everything together using `plot_grid()` yet again. Now that we have rows, we can combine everything as a single column and easily modify the relative heights using the `rel_heights()` function so that our title, subtitle and legend are all relative short relative to the plots. We will make the first row half the height of the second row. 

Finally, we will use the `png()` function of the `grDevices` package which is automatically loaded in RStudio sessions to save the plot. We will use the `here()` function of the `here` package, to specify that we want to save it in the `img` directory and call it `mainplot.png`. We can also use this function to specify the resolution with `res` and in doing so, we need to save the image with size specifications to make it larger. 

```{r, eval = FALSE}
png(filename = here::here("img", "mainplot.png"), 
         res = 300, width = 10, height = 10, units = "in")
plot_grid(title_plots, 
          forward,
          row_1,
          row_2,
          legend,
          ncol = 1, 
          rel_heights = c(0.1,0.1,.5, 1, 0.2))

dev.off()
```

The `dev.off()` function is necessary to close the graphics device. This is good practice to allow you to create another plot again later.

```{r,echo = FALSE}
knitr::include_graphics(here::here("img", "mainplot.png"))
```

avocado: I want to add something about self-reporting bias to the  objectives

### Synopsis

In this case study we evaluated self-reported measures of depression symptoms among youths age 12-17 in the United States, as well as the rate of youths receiving treatment for depression. We learned how to scrape data directly from the web using the `rvest` package and we learned how to perform and interpret a chi-square test using the `chisq.test()` function of the `stats` package.

By analyzing and plotting our data, it is clear that depression rates appear to be increasing, particularly since 2011. However, it is possible that respondents had similar rates in previous years, but now feel less stigma about responding about depression symptoms when filling out the survey. The survey has always been anonymous, but [reporting bias](https://en.wikipedia.org/wiki/Self-report_study#:~:text=Self%2Dreport%20studies%20are%20inherently,answers%20will%20be%20more%20positive.){target="_blank"} can sometimes cause individuals to exaggerate or minimize their symptoms because of what they think researchers want their response to be or out of shame or embarrassment, among other reasons. However, the data suggests that youths may be experiencing more symptoms of depression and that the rate of increase is quite high. Now nearly a quarter of all individuals that were female and of age 12-17 reported experiencing symptoms of depression. This warrants further investigation to see if this is a product of more reporting or if indeed American females are truly more depressed. Furthermore,if they are indeed more depressed, investigation about why young females are more depressed is also of critical importance. One important limitation is that the data does not include subgroup intersections, such as rates of major depressive episodes among females of various ethnic backgrounds. Considering the very steep increase in females, this warrants further investigation about which females are particularly vulnerable and why. 


## **Homework**
*** 

Ask students to scrape tables 11.5A and 11.5B from the website which contain data about the receipt of treatment among youths who reported having a severe episode. Ask students to create plots and perform chi-square tests to evaluate how groups compare over time.


## **Additional Information**
***

### Helpful Links

[RStudio](https://rstudio.com/products/rstudio/features/){target="_blank"}  
[Cheatsheet on RStuido IDE](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf){target="_blank"}  
[Other RStudio cheatsheets](https://rstudio.com/resources/cheatsheets/){target="_blank"}   
[Tidyverse](https://www.tidyverse.org/){target="_blank"}   
[Selection bias](https://en.wikipedia.org/wiki/Selection_bias?oldformat=true){target="_blank"}  
[Sampling methods](https://en.wikipedia.org/wiki/Sampling_(statistics)?oldformat=true){target="_blank"}   
[Sampling frame](https://en.wikipedia.org/wiki/Sampling_frame?oldformat=true){target="_blank"}  
[DSM 5](https://en.wikipedia.org/wiki/DSM-5){target="_blank"}</summary>    
[National Survey on Drug Use and Health (NSDUH)](https://nsduhweb.rti.org/respweb/homepage.cfm){target="_blank"}   
[Substance Abuse and Mental Health Services Administration (SAMHSA)](https://www.samhsa.gov/){target="_blank"}   
[U.S. Department of Health and Human Services (DHHS)](https://www.hhs.gov/){target="_blank"}   
[NSDUH Survey Results Website](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm){target="_blank"} (where we got the data for this case study)    
[Details about the Survey](https://nsduhweb.rti.org/respweb/about_nsduh.html){target="_blank"}  
[Report about the 2018 NSDUH Survey](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetailedTabs2018.pdf){target="_blank"}  
[Web scraping](https://en.wikipedia.org/wiki/Web_scraping?oldformat=true){target="_blank"}  
[Selectorgadget Tool](https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html){target="_blank"}  
See this [blog post](http://research.libd.org/rstatsclub/post/introduction-to-scraping-and-wranging-tables-from-research-articles/#.Xw878ZNKhQJ){target="_blank"}, this [blog post](http://blog.corynissen.com/2015/01/using-rvest-to-scrape-html-table.html){target="_blank"}, and this [vignette](https://rstudio-pubs-static.s3.amazonaws.com/266430_f3fd4660b2744751ab144aa130768a06.html){target="_blank"} for more information about web scraping  
[CSS selectors tutorial](http://flukeout.github.io/#){target="_blank"} (and the [answers](https://gist.github.com/chrisman/fcb0a88459cd98239dbe6d2d200b02d1){target="_blank"})     
[Piping in R](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"}   
[Writing functions](https://r4ds.had.co.nz/functions.html){target="_blank"}
Also see [this case study](https://opencasestudies.github.io/ocs-bloomberg-vaping-case-study/){target="_blank"}  for more information on writing functions.   
[String manipulation cheatsheet](https://rstudio.com/resources/cheatsheets/){target="_blank"}  
[Table formats](https://en.wikipedia.org/wiki/Wide_and_narrow_data){target="_blank"}
[Pearson's chi-squared test](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test#:~:text=Pearson's%20chi%2Dsquared%20test%20is,differs%20from%20a%20theoretical%20distribution.){target="_blank"}  
[contingency table](https://en.wikipedia.org/wiki/Contingency_table){target="_blank"}  
[Chi-square distribution](https://en.wikipedia.org/wiki/Chi-squared_test#/media/File:Chi-square_distributionCDF-English.png){target="_blank"}  
[chi-square distribution applet](http://homepage.divms.uiowa.edu/~mbognar/applets/chisq.html){target="_blank"}  
See here for a more thorough explanation of the [chi-square test](https://www.ling.upenn.edu/~clight/chisquared.htm){target="_blank"}   
[`ggplot2` package](http://ggplot2.tidyverse.org){target="_blank"}    
Please see [this case study](https://opencasestudies.github.io/ocs-bp-co2-emissions/){target="_blank"}  for more details on using `ggplot2`. 
[grammar of graphics](http://vita.had.co.nz/papers/layered-grammar.html){target="_blank"}  
[`ggplot2` themes](https://ggplot2.tidyverse.org/reference/ggtheme.html){target="_blank"}   
[`directlabels` package methods](http://directlabels.r-forge.r-project.org/docs/index.html){target="_blank"}  
[Viridis palette for colorblind friendly plots](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html){target="_blank"}  
[Motivating article for this case study about depression rates](https://pubmed.ncbi.nlm.nih.gov/30869927/){target="_blank"} (Access is possible for those at Hopkins by using their email address)    

[Motivating article about the rate of youths seeking mental health services](https://pubmed.ncbi.nlm.nih.gov/24285382/){target="_blank"}    

[Cross-cultural review article about possible causes for increased depression](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3330161/){target="_blank"}

[Review article about social media and depression](https://childmind.org/article/is-social-media-use-causing-depression/){target="_blank"}


<u>**Packages used in this case study:** </u>

Package   | Use                                                                         
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data  
[rvest](https://github.com/tidyverse/rvest){target="_blank"}      | to scrape web pages  
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to scrape web pages  
[magrittr](https://magrittr.tidyverse.org/){target="_blank"}      | to scrape web pages  
[stringr](https://stringr.tidyverse.org/){target="_blank"}      | to manipulate strings  
[tidyr](https://tidyr.tidyverse.org/){target="_blank"}      | to change the shape or format of tibbles to wide and long  
[tibble](https://tibble.tidyverse.org/){target="_blank"}      | to create tibbles and convert a column to row names  
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}      | to create plots  
[directlabels](http://directlabels.r-forge.r-project.org/docs/index.html){target="_blank"}      | to add labels directly to lines in plots  
[forcats](https://forcats.tidyverse.org/){target="_blank"}      | to reorder factor for plot  
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"}      | to combine plots together  

#### {.emphasis_block}
If you are in crisis and need help, call this toll-free number for the **National Suicide Prevention Lifeline (NSPL)**, available 24 hours a day, every day: **1-800-273-TALK (8255)**. The service is available to everyone. The deaf and hard of hearing can contact the Lifeline via TTY at 1-800-799-4889. All calls are confidential. You can also visit the Lifeline’s website at [www.suicidepreventionlifeline.org](www.suicidepreventionlifeline.org){target="_blank"}.

The **Crisis Text Line** is another free, confidential resource available 24 hours a day, seven days a week. Text “HOME” to **741741** and a trained crisis counselor will respond to you with support and information over text message. Visit [www.crisistextline.org](www.crisistextline.org){target="_blank"}.

####

Also see [here](https://www.mhanational.org/depression-teens-0){target="_blank"} for more information about how to recognize and help youths experiencing symptoms of depression


### Acknowledgements

We would like to acknowledge [Tamar Mendelson](https://www.jhsph.edu/faculty/directory/profile/1770/tamar-mendelson){target="_blank"} for assisting in framing the major direction of the case study.

We would also like to acknowledge the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/){target="_blank"} for funding this work. 




